The IRIDA platform is implementing ontologies as a means of enabling the integration of different data types required for outbreak investigation, surveillance and reporting. Ontology defines concepts and categories in a subject domain to efficiently describe their properties and the relations between them. Ontology-based software enables data to be both human and machine-readable as terms are standardized and reused between information types. IRIDA’s Genomic Epidemiology Application Ontology (GenEpiO) is another innovation that will enhance the platform’s analytical power.
The Need for Data Harmonization
Timeliness of infectious disease analyses is key for reducing the number of preventable cases of disease. The ability to resolve outbreaks relies heavily on good contextual information regarding “person, place and time”, which is crucial for identifying sources of contamination and exposure. Contextual information is also required for human health risk assessments, source attribution, ecosystems modelling, and in the simplest terms, to make sense of the genome data.
The digitization of genomics allows for increased resolution of infectious sequence types and rapid transmission of data, however, significant computational challenges remain in terms of genomics result reporting and analysis. Raw genome sequences need to be processed and presented differently and in a timely and secure manner to end-users in the health care environment with vastly different roles (attending physicians, infection control, environmental health officers, medical health officers, public health epidemiologists, etc.) and affiliations. The ability to share secure and standardized data within and across organizations is critical to implement genomic epidemiology for public health microbiology.
Sequence data and digitized contextual information are known as digital assets – that is, they can be used for many different purposes and investigations. Best data stewardship practices state that digital assets like contextual information should be stored in a way that is FAIR (Findable, Accessible, Interoperable, Reusable) to maximize value and best prepare the data for future applications.
Significant challenges for public health and infectious disease data integration are posed by the lack of standardization. Contextual information is often recorded using free text or incompatible data dictionaries. During an outbreak, information from different sources must quickly be harmonized and combined in order to identify the source of a pathogen and its routes of transmission – especially when outbreak investigations extend beyond agencies and borders. Manual recoding and integration of data can take hours, days or even weeks to complete. These challenges impact computability for fast analyses, affecting time-to-response.
Using standardized terms, or mapping institution-specific fields and terms to a controlled vocabulary, better enables software systems to communicate and facilitate data integration and exchange.
Ontologies as a Framework for Data Integration
A solution for providing a framework for integrating clinical, epidemiological and laboratory (genomic) data types is through the use of ‘ontologies’. Ontologies, well-defined and standardized vocabulary interconnected by logical relationships, are constructed in such a way to facilitate fast and automated querying.
Ontologies, simply put, are computer files which organize things into classes of terms, and link those classes together in different ways. Ontology files can be implemented in different spreadsheets, applications and platforms according to the needs of their users. Standardization of vocabulary allows for increased interoperability between systems and integration of previously isolated databases as well as resolving semantic ambiguity. Highlights of the benefits of ontologies for surveillance and detection activities include:
- Faster data integration and exchange based on standardized fields. The longitudinal nature of pathogen surveillance requires information to be propagated and compared between agencies, which can occur much more quickly and in a computer-amenable manner if contextual information is standardized.
- Mapping of institution-specific terms used in public health interfaces to standards allow for customized data entry while facilitating interoperability.
- Standardized quality control and result reporting trigger actionable events in same way, which will contribute to the accreditation and validation of clinically implemented genomics pipelines.
The Open Biomedical Ontologies (OBO) Foundry
The particular uses of an ontology can influence the way it is constructed. The architecture of an ontology can significantly impact the way it can interact with other ontologies, resulting in incompatibility. The OBO Foundry is a community of scientists committed to creating interoperable biomedical ontologies through collaborative development. The principles and practices of the OBO Foundry (e.g. common architecture, multiple users to increase usability, the use of IDs to disambiguate terms and their meanings) have created >150 interoperable ontologies that describe many different domains of knowledge e.g. the Gene Ontology (GO).
IRIDA’s Genomic Epidemiology Application Ontology (GenEpiO)
Our research efforts include the development of a Genomic Epidemiology Application Ontology (GenEpiO), based on public-health stakeholder interviews and the harmonization of important laboratory, clinical and epidemiological resources. The goal is to develop an ontology that supports an end-to-end genomic epidemiology pipeline, in order to fully propagate all of the necessary contextual information required to interpret genomics data, from the point-of-intake through sequencing to end use (eg. in an epidemiologic investigation).
Since diseases do not respect international borders, uptake of a common, standard vocabulary for describing outbreak and surveillance activities is crucial for inter-jurisdictional interpretation of results and data sharing.
GenEpiO has been built according to the principles and practices of the OBO Foundry, and aggregates pertinent terminology from a number of existing OBO Foundry ontologies. GenEpiO contains >4000 key fields and terms to describe sample metadata, lab analytics, clinical information as well as exposures and epidemiological data. GenEpiO incorporates fields from community standards e.g. NCBI BioSample and the MIxS minimum information checklist, as well as existing ontologies to ensure the accuracy of meaning and facilitate interoperability between software systems.
The Genomic Epidemiology Consortium
Harmonization of the genomic epidemiology ontology can only be achieved by consensus and wide adoption, and international input and expertise is crucial to achieve these goals. In order to ensure that GenEpiO is sufficiently robust to serve all use cases, we have formed an inclusive International Genomic Epidemiology Ontology Consortium to build partnerships and solicit domain expertise. GenEpiO has been developed in collaboration with the International GenEpiO consortium, which has >80 members form 15 different countries. The consortium includes leaders from different health, regulatory, academic and standards communities, and representatives from different sectors. All interested individuals are welcome to participate. More information regarding GenEpiO’s design, how to contribute new terms, and our goals and activities, can be found at www.genepio.org. To join, or find out more about our please contact email@example.com.
In addition, other key ontology domains under development include Antimicrobial Resistance (ARO), Pathogen Surveillance Ontology (SurvO), and the Mobile Elements Ontology (MobiO), all critical for tackling the global threats of antibiotic resistance and emerging pathogens. As good food descriptors for food products and food production environments are key for surveillance and foodborne outbreak investigations, we have also created the Food Ontology (FoodOn) to hold this content, and have led the formation of the FoodOn Consortium to support its use in various academic, public health, and industry contexts.
Community contributions welcome.
Ontology Tool Development
We are also developing tools to better enable users to interact with our ontologies, such as the Genomic Epidemiology Entity Mart (GEEM). GEEM enables software developers to shop for fields and terms appropriate for the needs of their users in order to crate data specifications which can be used to create ontology-driven interfaces and applications. GEEM features browse & search, shopping cart and discussion tab (for ontology curators) functionalities.
For more information regarding GEEM, as well as other text parsing and text matching tools under development, please contact firstname.lastname@example.org.