The environmental sciences are witnessing a data revolution as large amounts of data are being made available at an increasing rate. Many datasets are being published through operational monitoring programs, research activities and global earth observation virtual laboratories. An important aspect is the ability to query relevant metadata which can potentially provide useful information to discover, access and interpret environmental datasets, information about the data providers themselves, data services, data encodings, observation and measurement properties and data service endpoints. However, support for producing and accessing metadata descriptions in a flexible, extensible, easily integrated and easily discovered manner is lacking as current methods require interpreting multiple standards and formalisms. In this paper, we propose components to streamline discovery and access of hydrological and environmental data: a Data Provider Node ontology (DPN-O) which allows precise descriptions to be captured about datasets, data services and their interfaces; and a Data Brokering Layer which provides an Application Programming Interface (API) for registering metadata for discovery and query of registered DPN datasets. We discuss this work in the context of the eReefs project which is developing an integrated information platform for discovery and visualization of observational and modelled data of the Great Barrier Reef.
INTRODUCTION
The eReefs project is developing an interoperable coastal information platform (IP) for the conservation of the Great Barrier Reef (Car 2013; Car & Hodge 2013) to meet a number of use cases specified in Car & Murray (2013) (see Table 1). This IP aims to provide integrated discovery, query and visualization across a number of datasets access via multiple scientific repositories in multiple agencies working across multiple domains. Besides implementation of the IP, best practice methodologies will be adopted by streamlining data acquisition and integration into the eReefs IP. End users, such as researchers and client applications, will be able to access data and metadata through the platform in the same manner regardless of data source while minimal customization of existing infrastructure will be required of data providers.
Title . | Category . |
---|---|
Discover data |
|
Access data |
|
Add new data to a service |
|
Add a new service |
|
List datasets not compliant with the eReefs IP data model |
|
Link vocabulary terms to external vocabularies |
|
List all services not responding in a timely manner |
|
Title . | Category . |
---|---|
Discover data |
|
Access data |
|
Add new data to a service |
|
Add a new service |
|
List datasets not compliant with the eReefs IP data model |
|
Link vocabulary terms to external vocabularies |
|
List all services not responding in a timely manner |
|
A harmonized Water Quality vocabulary has been developed and published at http://environment.data.gov.au/def/ (Simons et al. 2013a, 2013b; Cox et al. 2014a). This vocabulary harmonized a number of controlled vocabularies in the Australian context with the Chemical Entities of Biological Interest (CHEBI) ontology and Quantities, Units, Dimensions and Types (QUDT) ontology and is encoded as a Simple Knowledge Organization System (SKOS) vocabulary. It has been developed to support the eReefs IP and is published using the Spatial Information Services Stack Vocabulary Service (SISSVoc) (Cox et al. 2014b). The SISSVoc vocabulary search tool is used to help data providers and end users discover appropriate vocabulary definitions (see http://sissvoc.ereefs.info/search/).
In Yu et al. (2014), a methodology was proposed for enhancing data services using the water quality vocabularies. The methodology outlines how existing data services publishing netCDF-CF (http://cfconventions.org/, last accessed 31 March 2015) and WaterML2.0 (Taylor 2012; Taylor et al. 2013) can be enhanced with links to the water quality vocabulary using a ‘Linked Data’ approach and standardized web vocabularies. An example of the use of this methodology for enhancing netCDF metadata with links to the water quality vocabulary is given in the listing below. In the example below, the original netCDF metadata definition ‘Nap_MIM’, i.e. the first six lines, is annotated with links to related water quality vocabulary definitions in the subsequent five lines, which specify links to specific definitions regarding the observed scaled quantity, the unit of measure, the substance or taxon being observed, the medium of the observation and the procedure used to obtain the observation. Each of these links can then be resolved to semantic descriptions for those definitions which give more precise detail and links to other concepts. This provides a flexible and lightweight annotation method and resolution to more semantics than the original label in the netCDF header ‘TSS, MIM SVDC on Rrs’.
float Nap_MIM(time, latitude, longitude) ; |
Nap_MIM:_FillValue = -999.f ; |
Nap_MIM:long_name = "TSS, MIM SVDC on Rrs" ; |
Nap_MIM:units = "mg/L" ; |
Nap_MIM:valid_min = 0.01209607f ; |
Nap_MIM:valid_max = 226.9626f ; |
Nap_MIM:scaledQuantityKind_id ="http://environment.data.gov.au/water/quality/def/property/solids-total_suspended" ; |
Nap_MIM:unit_id = "http://environment.data.gov.au/water/quality/def/unit/MilliGramsPerLitre" ; |
Nap_MIM:substanceOrTaxon_id = "http://environment.data.gov.au/water/quality/def/object/solids"; |
Nap_MIM:medium_id = "http://environment.data.gov.au/water/quality/def/object/ocean" |
Nap_MIM:procedure_id = "http://data.ereefs.org.au/ocean-colour/MIM_SVDC_RRS" ; |
float Nap_MIM(time, latitude, longitude) ; |
Nap_MIM:_FillValue = -999.f ; |
Nap_MIM:long_name = "TSS, MIM SVDC on Rrs" ; |
Nap_MIM:units = "mg/L" ; |
Nap_MIM:valid_min = 0.01209607f ; |
Nap_MIM:valid_max = 226.9626f ; |
Nap_MIM:scaledQuantityKind_id ="http://environment.data.gov.au/water/quality/def/property/solids-total_suspended" ; |
Nap_MIM:unit_id = "http://environment.data.gov.au/water/quality/def/unit/MilliGramsPerLitre" ; |
Nap_MIM:substanceOrTaxon_id = "http://environment.data.gov.au/water/quality/def/object/solids"; |
Nap_MIM:medium_id = "http://environment.data.gov.au/water/quality/def/object/ocean" |
Nap_MIM:procedure_id = "http://data.ereefs.org.au/ocean-colour/MIM_SVDC_RRS" ; |
Using domain definitions in the standardized web vocabularies, they allow existing datasets to be enhanced with consistent semantics. This in turn allows data to be more easily discovered, accessed, integrated and analysed. This approach is similar to general ‘Linked Data’ methodologies where resources are linked to other resources using the Resource Description Framework (RDF) and where both resource and resource relationship types are found in standardized web vocabularies. It is not currently possible to include RDF metadata in netCDF headers due to netCDF formatting restrictions however, netCDF-LD – an extension to the current netCDF standard – has also been proposed which will further allow the semantics of netCDF (www.unidata.ucar.edu/software/netcdf/, last accessed 31 March 2015) metadata to be linked with domain semantics and other web resources using semantic web technologies such as RDF (Yu et al. 2015). More work is required to implement it through the netCDF standard formats.
The eReefs Data Brokering Layer (DBL) (previously introduced in Car et al. 2014), uses the brokering pattern to provide an Application Programming Interface (API) for querying, filtering and facilitating access to relevant eReefs data provider node (DPN) descriptions, service implementations, related datasets and resolvable data endpoints via searching over harvested metadata and semantic definitions. The DBL is a key middleware component in the eReefs IP and provides the means for data providers to register their data services. This is implemented via simple RDF descriptions of both their data and the services that deliver them. The DBL also provides a RESTful API for client applications to query and filter relevant information about data provider nodes, their advertised service endpoints and their available datasets, however, specific details of the DBL implemented were not presented previously.
Providing middleware interfaces and data brokering as the eReefs DBL does add considerably to the flexibility of the eReefs system as a whole. eReefs follows several other distributed, environmental, data systems in implementing a brokering approach, such as Australia's NEII (Bureau of Meteorology 2014) and Global Earth Observation System of Systems (GEOSS) (Nativi et al. 2015). eReefs also adopt the general principles that Nativi et al. (2013) give regarding the brokering approach:
Autonomy: keep existing data infrastructures as autonomous as possible;
Subsidiarity: supplementing existing infrastructure and governance arrangements with mediation;
Interconnection: connecting existing infrastructure with tooling and approaches;
Low entry barrier: minimising resources needed to participate;
Flexibility: accommodate existing and future systems and technologies; and
Effectiveness: deliver the information needs of the users.
In this paper, we present details of the current eReefs implementation of the DBL and its API. We also show how the various components, such as the water quality vocabularies and the Observable Properties ontology, and methodologies, such as the eReefs conventions for metadata annotation of netCDF headers, are leveraged by the DBL to enhance discovery of datasets and how this approach differs from other brokering approaches. In the next section we present the Data Provider Node concept and its implementation as an Web Ontology Language (OWL) ontology for describing and annotating the semantics around data providers, the data provider nodes they are responsible for, their respective datasets and their web-resolvable endpoints. The following section presents the eReefs DBL, its design, implementation as a RESTful API and its use in client applications and front-ends. This is followed by a section where we present some implications from the DBL and then other related work. Finally, we conclude and consider future work in the final section.
DPN CONCEPT AND THE DPN ONTOLOGY
Environmental datasets can often exist in a multiplicity of representations. In a service-oriented architecture, datasets may be replicated across different service implementations via identical service interfaces, for example, a coverage dataset of chlorophyll-a concentration delivered via two instances of the Web Map Service (WMS), one implemented using Geoserver and the other using MapServer. In practice, due to efficiencies and preferences for certain implementations, datasets may also be delivered in a range of formats via different sets of service implementation and service interfaces, for example, the same coverage dataset of chlorophyll-a concentration may be delivered as netCDF via THREDDS (www.unidata.ucar.edu/software/thredds/current/tds/, last accessed 31 March 2015) and as JavaScript Object Notation (JSON) (www.json.org/, last accessed 31 March 2015) data via RESTful APIs. In both cases, the abstract notion of a single dataset persists beyond these concrete representations.
Metadata catalogs may be used to facilitate the registration and harvesting of web services metadata. However, there are currently limitations for using metadata catalogs’ entries to enable the discovery and harvesting of data services and their capabilities. Examples are Catalog Services for the Web (CSW) as standard interfaces (Senkler 2007). Implementations include GeoNetwork (http://geonetwork-opensource.org, last accessed 31 March 2015) and pyCSW (http://pycsw.org, last accessed 31 March 2015). Comprehensive Knowledge Archive Network (CKAN) (http://ckan.org, last accessed 31 March 2015) has CSW extensions available. These rely on strongly constrained data model-based methods. The current metadata catalog implementations are dataset-centric and limited in their ability to catalog data services as first-class objects.
On the other hand, RDF- and OWL-based ontologies can be used to describe metadata about dataset, data services and dataset relations, in a form that then allows this metadata to be published as Linked Data (Bizer et al. 2009). RDF provides a flexible platform for descriptions of resources, which are identified using Uniform Resource Identifiers (URIs), and linked internally and externally to form a graph of knowledge.
There have been a number of ontologies developed for describing datasets and services (see Table 2). The services ontologies shown in Table 2 below were designed with the use cases of service orchestration and choreography, which focus on representing the various bespoke service interfaces, inputs and outputs, formats for marshalling and unmarshalling objects and enterprise messaging automation. These include Web Service Modeling Ontology (WSMO) (Roman et al. 2005), OWL-S (Martin 2004), and SAWSDL (Kopecký et al. 2007) and reflects the lineage of these ontologies from the web services domain of Web Services Description Language (WSDL) and Simple Object Access Protocol. Other service ontologies provide lightweight descriptions of RESTful APIs which include Web Application Description Language (WADL) (Hadley 2009), and hRests (Kopecký et al. 2008) as shown in Table 2. Verborgh et al. (2014) provide a more in-depth survey of various kinds of approaches for semantic descriptions of REST APIs comparing lightweight approaches, logics-based and JSON-based approaches. Table 2 also presents a number of ontologies for describing datasets. In many cases, datasets can be subsets of larger datasets, for example a data cube consisting of a few dimensions taken from a multi-dimensional dataset. This is largely supported by the dataset ontologies shown in Table 2, however, the ontologies tend to differ in the way the dimensions of datasets are actually described. The RDF Data Cube (Cyganiak & Reynolds 2014) allows observed values to be defined natively as an RDF Dataset and allows the multi-dimensionality of the datasets to be represented easily. VoID (Alexander et al. 2011) defines a Dataset as ‘a set of RDF triples that are published, maintained or aggregated by a single provider’ capturing the governance of the data. However, most environmental datasets are provided in encodings other than RDF. Data Catalog Vocabulary (DCAT) describes a dataset as ‘a collection of data, published or curated by a single agent, and available for access or download in one or more formats’ (W3C 2014). DCAT allows the dataset's metadata and its distribution means to be captured in RDF and does not natively represent the dataset itself using RDF. The DCAT Distribution class is used to allow various forms of a dataset to be defined, such as a download URL or an API.
Ontology . | Dataset/service . | Complexity . | Main purpose . |
---|---|---|---|
WSMO and WSMO lite | Service | Medium | The WSMO semantics that distinguishes four semantic aspects of services: function, behaviour, information model, and non-functional properties, which together form a basis for semantic automation and service orchestration |
OWL-S | Service | Medium | Supports representation of web service descriptions using RDF and its behaviour for service orchestration |
SAWSDL | Service | Medium | Enriches WSDL with semantic annotations in RDF for classifying, discovering, matching, composing, and invoking Web services |
hRESTS | Service | Low | Modelling Web API details relevant for invocation support |
SSWAP | Dataset/Service | Low | Uses RESTful architecture concepts and provides simple OWL ontology for describing Providers, Resources, Graph, Subject, and Object. Allows for ontology reasoning to support semantic search and service matchmaking |
WADL | Service | Low | Provides XML-based descriptions of HTTP-based Web applications – web resources available, associations, methods and data format MIME types available |
VOID | Dataset | Low | Supports descriptions of RDF dataset metadata. Assumes a dataset is natively encoded as RDF. Allows linking between datasets via void:LinkSet |
RDF Data cube | Dataset | Medium | Provides an ontology to describe datasets as a set of observed values organized along a group of dimensions, together with associated metadata. There is a strong alignment with statistical dataset use cases and is based on the SDMX approach for statistical data exchange |
DCAT | Dataset | Low | An RDF vocabulary for supporting interoperability between web accessible data catalogs. Defines a concepts for Dataset, Distribution, Catalog, Catalog record and reuses the FOAF concepts for Organisation and Person |
Ontology . | Dataset/service . | Complexity . | Main purpose . |
---|---|---|---|
WSMO and WSMO lite | Service | Medium | The WSMO semantics that distinguishes four semantic aspects of services: function, behaviour, information model, and non-functional properties, which together form a basis for semantic automation and service orchestration |
OWL-S | Service | Medium | Supports representation of web service descriptions using RDF and its behaviour for service orchestration |
SAWSDL | Service | Medium | Enriches WSDL with semantic annotations in RDF for classifying, discovering, matching, composing, and invoking Web services |
hRESTS | Service | Low | Modelling Web API details relevant for invocation support |
SSWAP | Dataset/Service | Low | Uses RESTful architecture concepts and provides simple OWL ontology for describing Providers, Resources, Graph, Subject, and Object. Allows for ontology reasoning to support semantic search and service matchmaking |
WADL | Service | Low | Provides XML-based descriptions of HTTP-based Web applications – web resources available, associations, methods and data format MIME types available |
VOID | Dataset | Low | Supports descriptions of RDF dataset metadata. Assumes a dataset is natively encoded as RDF. Allows linking between datasets via void:LinkSet |
RDF Data cube | Dataset | Medium | Provides an ontology to describe datasets as a set of observed values organized along a group of dimensions, together with associated metadata. There is a strong alignment with statistical dataset use cases and is based on the SDMX approach for statistical data exchange |
DCAT | Dataset | Low | An RDF vocabulary for supporting interoperability between web accessible data catalogs. Defines a concepts for Dataset, Distribution, Catalog, Catalog record and reuses the FOAF concepts for Organisation and Person |
From the brief survey of the current dataset and service ontologies, there are some well-covered areas but also some gaps regarding representing datasets in ways that capture the services used to deliver them, governance aspects of multiple data services and dataset owners participating in a distributed and federated IP. In particular, the DCAT ontology appears to align with the notion of a conceptual dataset that is decoupled from the data format, catalog record and distribution means. DCAT does not, however, allow the data services aspect to be characterized, although one may extend the definition of Distribution to define a Web Service class.
The services ontologies detail technical aspects of web service orchestration such as messaging protocols and formats, yet there are a number of established, non-ontology-based nevertheless standardized, data web services, some with well-known implementations. For example, the international spatial data standards body the Open Geospatial Consortium (OGC) has services such as the Web Feature Service (WFS) (Vretanos 2010) with implementations such as GeoServer (http://geoserver.org/, last accessed 31 March 2015). The academic and research collaborative program Unidata (www.unidata.ucar.edu/about/, last accessed 31 March 2015) has THREDDS (www.unidata.ucar.edu/software/thredds/current/tds/, last accessed 31 March 2015). These data web services are well known and widely used, supported by many software libraries, but are not based on either the WS* stack or RESTful practices, which did not exist when they were originally developed after their development, so are incompatible with the service ontologies listed above. Therefore, the gap which a services ontology would fill is the ability to describe metadata about the available web services – service interfaces, implementation and their relevant web endpoints.
To our knowledge, there is currently no ontology that can relate multiple services to a notion of a conceptual dataset, provide social and governance metadata about such a dataset and also provide lightweight descriptions of known services.
The DPN ontology
The DPN-O contains classes for DPNs (institutions or sub-institutional groups) that contain Services which have various Interfaces. The list of described service types can be added to allowing IPs to deliver a growing range of services. The use of an ontology, implemented in RDF, allows the conceptual dataset entities to act as an index to which services are related. The RDF Individuals instantiated from the respective classes allow identity (a URI) to be minted for each of them.
In DPN-O, the definition of the dpn:Dataset and dpn:ServiceInterface classes are stubs, as they are expected to be substituted by classes from existing and well-accepted ontologies, such as DCAT for downloadable datasets and WSMO-lite service interfaces for service orchestration. The design of the DPN-O supports the descriptions of relationships between DPNs and the respective services, datasets, and organisations. Therefore this allows the representation of actual governance arrangements, datasets and their hosting arrangements, and technical deployments of data services.
EREEFS DBL
The eReefs DBL allows mediation between client applications, APIs, widgets and the services and datasets to which DPNs provide access. A Linked Data approach is implemented where people and automated agents can use a ‘Follow-Your-Nose’ method to incrementally discover more data and metadata about things by following typed links from a starting point. The Hypertext Transfer Protocol (HTTP) and HTML and RDF are the mechanisms and data formalisms used for web link resolutions and machine readability. URIs provide identity to elements in the system – datasets, observed properties, services, and data providers – and HTTP the protocol for access to them. Ontologies, implemented in RDF, provide semantics about those identified elements, in a web-compatible form, to allow clients to traverse knowledge domains.
The DBL's RESTful API facilitates data and service discovery by client applications. It includes methods which align with the DPN ontology core elements as well as search functionality. The API's responses are encoded in JSON-LD which is a profile of JSON used for Linked Data as well as a simpler JSON serialisation for less verbose data descriptions (see Table 3). This allows a decoupling of the client applications from the services and more flexible data integration into the applications.
ID . | REST API endpoint . | HTTP method . | Parameters . | Response content type . |
---|---|---|---|---|
1 | /dpn/ | GET | N/A | application/json |
2 | /catalog/ | GET | N/A | application/json |
3 | /service/ | GET | application/json | |
/service/?_format = json-ld | _format | application/ld + json | ||
4 | /dataset/ | GET | N/A | application/json |
5 | /search | GET | term/uri offset limit prefLabel | application/json |
ID . | REST API endpoint . | HTTP method . | Parameters . | Response content type . |
---|---|---|---|---|
1 | /dpn/ | GET | N/A | application/json |
2 | /catalog/ | GET | N/A | application/json |
3 | /service/ | GET | application/json | |
/service/?_format = json-ld | _format | application/ld + json | ||
4 | /dataset/ | GET | N/A | application/json |
5 | /search | GET | term/uri offset limit prefLabel | application/json |
The eReefs DBL API instance endpoints given in Table 3 are available via the eReefs DBL URL: http://ereefs.org.au/dbl. The /dpn endpoint gives a list of DPNs registered with the DBL, which users can query further to get lists of services, datasets etc. The /catalog endpoint provides a list of particular implementations of services, for example the THREDDS data service catalog (refer to http://ereefs.org.au/dbl/catalog).
The /service/ endpoint acts as a register of the DBL's services. HTTP Query String Arguments may be added to the /service/ URI to filter results. They provide a view of the semantics of the DPN and its available services, hosted datasets and service endpoints. Similarly, the /dataset/ endpoint return metadata on the available datasets.
Dataset classification
A key aspect to integrating data into the eReefs IP is adopting the eReefs conventions for annotating data with specific classification metadata using methodology given in Yu et al. (2014). This includes metadata about the substance or taxon of interest, scaled quantity kinds, units of measure and medium that the observations forming the data relate to. The choice of these metadata concepts is informed by the Observations & Measurement standard (Cox 2011) and formalized in the Observed Properties ontology (Cox et al. 2014a). Specific values for the various concepts are taken from vocabularies relevant to a particular IP. Figures 6 and 7 both show the eReefs DBL instance using a ‘Water Quality’ domain vocabulary for instances of concepts such as substance or taxon of interest. Navigable URIs uniquely identify concepts in the vocabulary and form the basis of the linked data approach to eReefs metadata.
For each data provider, a process for binding vocabulary URIs to data is required. A business analysis process takes place to establish the association between vocabulary concepts and data provider data. Typically, this occurs as a series of conversations between an eReefs vocabulary expert and a data provider administrator who has knowledge of the relevant science domain. The outcome of this process is an informal document describing data mappings between URIs, which uniquely identify vocabulary concepts, and data variables.
By associating defined concepts with defined relationships and data, the eReefs infrastructure can effectively index data across many different DPN and present users with many axes of classification on which to find and judge data the DPN's data is fit for particular tasks.
The DBL/search endpoint allows clients to query for data against declared semantics harvested from the dataset metadata which cross-references domain vocabulary terms. The search endpoint accepts either keywords or the specific URI for a term and returns datasets using that term or URI for example, for the dataset shown in Figure 7, ‘chlorophyll-a concentration’ would be found by using the search endpoint with keyword thus: http://ereefs.org.au/dbl/search?term=chlorophyll or with URI thus: http://ereefs.org.au/dbl/search?uri=http://environment.data.gov.au/def/object/chlorophyll. Other parameters allow paging of results and other delivery conveniences.
The DBL API presented in this paper is simple but provides a component for establishing a registry system for data providers and their hosted services and datasets. The DPN-O descriptions provide the necessary metadata about the DPNs and their service and datasets. Registry of these descriptions into a DBL instance allows data discovery about DPNs, services and datasets and facilitates querying and processing of content in a flexible way while preserving relevant contextual information.
The eReefs Viz Portal client demonstrator
Other possible clients for the DBL include map visualization widgets, faceted data search, and automated generation of environmental report cards. Without a mediating layer to the data services and datasets, each client would require software libraries to handle binding and access to the data individually. The DBL provides an API to allow these clients to be developed in a light-weight fashion where the functionality of data query and access is already taken care of.
DISCUSSION
The DBL provides a mechanism for registering DPN-O instances and enables rich data and services discovery. In this paper, we have presented an implementation of the DBL in which DPNs use well-known data web services such as THREDDS and harvests relevant metadata from them. A cache of DPN metadata is built and indexed within the DBL which allows powerful querying, filtering and access to data across potentially very heterogeneous data sources which may yet be compared on many axes of classification. As the DBL is a web service API, many client applications and user interfaces can be made to utilize it to meet a wide range of applications such as map visualisations, faceted data search, and automated generation of environmental report cards. The DBL also allows a client application to be decoupled from the data services themselves which allows for dynamic discovery of datasets, e.g. a client application that binds its data queries via a DBL for gridded data for ‘chlorophyll concentration’ datasets will receive all related datasets via a vocabulary-based search across all registered DPNs. As new DPNs are added and registered into a DBL, there is no added work required for such client applications to receive updated datasets.
The current process for defining a DPN-o instance and the binding of vocabulary definitions for a given DPN is manually performed on behalf of the data providers. The intention of the current phase of eReefs was to demonstrate its application for data brokering and integration. The cost of implementing the set of DPN-o instances for the data providers at the moment is small as there are five DPNs, although as an increasing number of DPNs are required to be integrated, the manual approach is not efficient.
The data brokering approach presented in this paper, which uses the DBL, the DPN-O and the respective environmental vocabulary definitions, provides a number of benefits for the eReefs project. It provides the project with added flexibility and extensibility in being able to develop multiple client applications and in registering additional or deregistering data services on-demand as it decouples the binding of data assets from the client applications.
Despite the benefits of the DBL and the brokering architecture, there is a cost in specifying both the DPN-O instance and the bindings to the vocabulary definitions. Therefore, tools to automate the generation of DPN-o instances are required to facilitate DPNs to be registered on-demand, e.g. web-based user interfaces as well as REST Web APIs. Similarly, additional tooling is required to guide users in selecting appropriate vocabularies for binding to vocabulary definitions for configuring netCDF headers. An evaluation of these tools and their effectiveness in reliably and simply developing the DPN-o instances and mappings is also required. These tools are being developed as items of future work. Since the DBL indexes service endpoints, it can be used at the access point for service conformance (both structural and performance) testing in order to ensure services meet certain IP expectations or contracts. Currently, test frameworks bind to static definitions of service endpoints and test classes are hard-coded for each type of service implementation. The ability to define tests based on service types and expected outputs from inputs queried by the DBL would allow a flexible way of testing conformance for the DPN services. Further work is required to develop a test framework which leverages the DBL.
Despite the limitations of current metadata catalog implementations, e.g. Catalog Service for the Web, CKAN, and ERDDAP, they provide functionality to discover and harvest metadata describing datasets metadata. Thus, additional work is required to interface them so that metadata from these catalog implementations will be able to be harvested and handled by the DBL.
RELATED WORK
A number of prior and current efforts are underway to develop integrated data platforms especially in the environmental domain. The DataOne (Allard 2012) initiative has developed an integrated data platform bringing together multiple disciplines and organisations to provide tooling and data to support biological, ecological, and environmental science and research. The DataOne platform provides a means for sharing data, tooling and findings. Thus DataOne's scope is much broader than the eReefs use cases and includes collaborative tool development and a wider stakeholder engagement. DataOne does recognize the need for a nodal approach with the concept of a DataOne Member Node, which may include existing or new repositories. The work presented in this paper provides an ontology for describing these nodes as DPNs and tooling for precise semantic definitions of the datasets as well as support service endpoint discovery in a lightweight fashion.
Catalog services have been recognized as useful components in integrated data platforms. The USGS Geo Data Portal integration (Blodgett et al. 2012) is a platform for brokering access to geospatial vector and gridded datasets via data services with the use of catalog metadata to facilitate users to interpret and select aggregated datasets via a portal. In contrast to the USGS Geo Data Portal, the approach in the eReefs project is to enable richer dataset and service metadata capture using semantic web technologies so that the utility of metadata can be broadened beyond users to allow machines to facilitate richer data discovery.
The EVOp project (Vitolo et al. 2015) seeks to also establish an integrated data platform and a portal and has focused on the ability to allow model selection for datasets (e.g. between TOPMODEL and FUSE) as well as extracting data from a wide range of data sources via ad-hoc methods (e.g. web-scraping) and service-based real-time data endpoints. The authors recognize that ‘self-describing data formats would be a better solution to store and transfer environmental data as they could integrate metadata information and standardised definitions of domain-specific variables and uncertainties’, which is the approach the eReefs project has taken.
The DPN-O introduced in this paper has defined the dpn:Dataset, dpn:Service and dpn:ServiceInterface classes as stubs, which provides the option of further specialising them with definitions from existing and well-accepted ontologies. The reason for this is to ensure minimal ontological commitment and opportunities for greater reuse of the DPN-O than if implementation details were defined. This also allows the opportunity for multiple formalisms to be used with the DPN-O as required in different contexts and implementations. Therefore, this design accommodates a wide range of metadata formalisms. Relevant ISO standards may be used as specialisations of these classes – namely ISO 19119 for describing services metadata; and ISO 19115 for describing metadata for geographic datasets. These ISO standards have been developed under the governance of the Technical Committee ISO/TC 211 for Geographic Information. Both ISO standards are done using platform-neutral Universal Modeling Language (UML) modelling. The current practice is to map these to XML-based implementations, however, mapping the conceptual models to OWL/RDF-based implementations would enable these standards to be used as options to align or specialise the above DPN-O classes. OWL-S has been used in previous work to align the ISO standards with OWL/RDF ontologies (Yue et al. 2007), however, as noted above under the section ‘DPN concept and the DPN ontology’, in some cases, OWL-S provides a more heavyweight approach than the proposed approach of using DPN-O. The use of the DPN-O approach provides a lightweight alternative to OWL-S by providing the option to specify a small amount of information about the service and the dataset for client to be discoverable and useful. The DPN-O can also be specialised further to describe services and datasets in much more detail if necessary, for example, with the full suite of the above service and dataset metadata standards. Future work is required to demonstrate the use of such services and dataset ontologies and models, e.g. ISO 19115 and 19119, for encoding metadata about relevant services and datasets using the DPN-O.
eReefs shares the aim of the GEOSS project (Nativi et al. 2015) in an attempt to reduce the effort required by data contributors by brokering access to service-delivered data. Unlike GEOSS’ Ranking Algorithm approach presented in Nativi et al. (2015), the eReefs DBL uses a separation between a dataset's domain metadata (metadata of the form usually encountered in data catalogs) and service metadata, which indicates how a dataset is made available. This allows discovery of data independently of data access and does not impose a judgement call on data access via a rank. It does have the overhead of requiring service type (or service interface) definitions with type vocabularies, as per the DPN-O. When service definitions and description vocabulary terms are used, dataset access can be related unambiguously to users, rather than proxied with a rank. This more open-ended approach was taken to ensure that dataset access evolution could be catered for and to give choice to potential dataset users. The rise in popularity of non-standardised data services such as InfluxDB (http://influxdb.com/) for timeseries data in the commercial domain in place of standards such as the Sensor Observation Service, make it clear that, if possible, new or less well-established services need to be catered for in the environmental domain. Our brokering approach only requires semantic descriptions of a service’ endpoints for this to occur and will allow users to judge the service's utility.
CONCLUSION AND FUTURE WORK
In this paper we have presented the DPN concepts and ontology formulation for describing data owners, datasets and data services. We have also presented the DBL which provides a general purpose API for querying, filtering and facilitating access to sets of DPNs and their data holdings via harvesting their metadata and linking them to definitional datasets such as domain vocabularies. We have presented an instance of the DBL as a key middleware component in the eReefs IP. It provides the means for registering metadata about a data provider, its set of data services as a DPN and the datasets via a simple RDF description using the DPN ontology. We discussed how the various components, such as the water quality vocabularies, the Observable Properties ontology and the eReefs conventions for data annotation, are leveraged by the DBL to enhance discovery of datasets. We also provided example uses of the DBL in prototype client application called the eReefs Viz Portal.
This paper has identified gaps in the existing dataset and service ontologies and has presented an ontology developed to capture a DPN concept and the idea of a conceptual dataset and related services. The DPN-O description allows semantic representation of organizations, groups, DPN, datasets, service types, service implementations and their web endpoints using semantic web technologies. The data delivered via DPNs can be annotated with links to domain concepts, for example, water quality observable properties like chlorophyll-a concentration. DPN-O is an improvement on existing ontologies by proposing simple constructs for the use cases outlined in this paper. DPN-O descriptions are lightweight and extensible, easily allowing any number of existing and new service types to be linked to DPN's datasets. Since semantic web technologies adopt the open-world assumption, descriptions can be simple or highly detailed in line with DPN owners’ wishes and expectations of metadata utility.
Further work is required to develop tools to automate the process of creating DPN-O instances as well as binding vocabulary definitions to the respective data services. Also, future work will be undertaken to develop additional client applications to meet other eReefs use cases including mobile applications and generating reports as well as conformance checking of data assets registered in eReefs through the DBL.
ACKNOWLEDGEMENTS
This research was funded through the eReefs research project by CSIRO, the Science and Industry Endowment Fund (SIEF) and the Great Barrier Reef Foundation (GBRF).