Introduction

GBOL is a genome annotation ontology based on semantic web technologies and provides the means to consistently describe automated genome annotations typically found in the GenBank, EBML and GFF formats. Additionally, it can describe the linked data provenance of the abstraction process of genetic information from genome sequences.

An overview of the structure of the GBOL ontology is shown on Figure “GBOL structure”. The essential design principle of GBOL is that sequences have features, which in turn have genomic locations on the sequence. These relationships are associated to provenance that captures both the statistical basis of each individual annotation (element-wise provenance) and the programs and parameters used for the complete set (dataset-wise provenance). All annotations for a given sequence can be packed into a single entity, a document.

GBOL structure

Figure “GBOL structure”: Network based view generated using RDF2Graph of the core of the GBOL ontology. Nodes represent types. Blue edges represent subClassOf relationships whereas grey edges represent unique type links. A unique type link is defined as a unique tuple: type of subject, predicate, (data)type of object. Arrow heads indicate the forward multiplicity of the unique type links: 0..1 and 1..1 multiplicities are indicated by diamonds; 0..N and 1..N multiplicities are indicated by circles.

GBOL key elements

Genome annotations include DNA sequences (e.g. chromosome, plasmid, contig), genes, transcripts, exons, introns, proteins, protein domains and genetic functional annotations for both prokaryotic and eukaryotic organisms. In the following we will indicate classes in italics/bold and properties by italics.

Genomic locations

Figure "Genomic locations": Graphical view of the GBOL ontology for genomic locations. An explanation of the classes is provided in the main text.

Genomic locations of all features in GBOL is captured with the Location, Position and StrandPosition classes, which are inspired by the FALDO ontology and represented in Figure "Genomic locations". The Location and its subclasses together with the StrandPosition define an interval on the sequence on the sequence, whereas Position defines a single location in a sequence. A location can be either: i) A Region which has begin and end positions; ii) A CollectionOfRegions (ordered or unordered); iii) A single Base at a given position; or iv) an InBetween location denoting a location between two bases after the base of which the position is given. Each region, base and in-between location can be defined to be located on the forward, reverse or both strands, although no strand should be specified if the sequence is a single stranded NA sequence or a protein sequence. All locations are by default located on the parent sequence, except if the reference sequence is given with the reference property. By default a location is associated to only one feature, if not then the reference feature property becomes obligatory. It should be noted that elements of a collection of regions can be located on different sequences. This can be used to encode cases on which a gene is located in two chromosomes, such as the Drosophila melanogaster mdg4 gene.

Exactly known positions can be indicated using the ExactPosition class containing the position property. Otherwise a not exactly known position, also called FuzzyPosition, can be indicated using either the BeforePosition class containing the position property, the AfterPosition class containing the position property, the InRangePosition class containing the beginPosition and endPosition properties or the OneOfPosition class containing multiple position properties.

Storage of the genomic location is inspired by FALDO. Differences include: i) StrandPosition is not subclassed from Position. Instead, an additional property is added to the Region, Base and InBetween location, this is done because these location object types can have both a strand position and an index position on the sequence. ii) The reference property is not part of a Position, but of a Location, because a location that starts on one sequence and ends on another sequence is an undefined sequence. iii) The BaseLocation and the InBetweenLocation classes have been added to the ontology. (iv) The BaseLocation, InBetweenLocation, CollectionOfRegions and Region are, childs of the Location class, such that the rest of the ontology can incorporate these classes. v) The before and after positions have been explicitly defined to include their semantics. vi) The classes sub-classed from FuzzyPosition have an integer to denote the position and do not point to another position object, which could allow for arbitrary complex location denotations. vii) The N- and C-terminal positions have been removed and all indexes are counted from the N-terminal side. Counting from the C-terminal side can be calculated based on the sequence length. vii) The reflective properties beginOf and endOf have been removed, because a position can also be referenced by the added base location.

Genes, transcripts and other features

Genes, transcript and other features

Figure "Genes, transcript and other features": Graphical view of the GBOL ontology for genes, transcripts and other commonly encountered genomic features. An explanation of the classes is provided in the main text

GBOL has a consistent model for storing genes, exons, (alternatively spliced) transcripts, coding sequences and proteins. Central to this model is the Sequence class that can have multiple annotations represented in the Feature class. An overview is provided in figure "Genes, transcript and other features".

In GBOL a sequence can be specified as a nucleic acid (NA) or a protein sequence. The sequence is attached to the Sequence class via the sequence property, provided in the DNA, RNA or protein encoding standard. NASequence can represent transcripts or other elements such as chromosomes, plasmids, scaffolds, contigs or reads. No distinction is made between DNA and RNA and the strandType denotes that it is either a double or single stranded DNA or RNA. As indicated in figure "Genes, transcript and other features" the type of sequence determines the features it might be associated to (ProteinFeature, NAFeature or TranscriptFeature).

For each Sequence object the sequence should be given with the sequence property using either the default DNA, RNA or Protein encoding standard. Any special elements should be denoted with the modified base and modified residue features.

A protein can only have protein features, whereas a transcript can only have transcript features and a nucleotide acid sequence can only have nucleotide acid features. Each features has a genomic location on its parent sequence, see for more details see section "Genomic locations". By default a feature can only have one parent (no reflective property included) sequence, except for a rare case such as a the mdg4 gene of Drosophila, which is a gene that is located on two chromosomes. Each feature and sequence can have cross references, citations and notes, please see section "Element-wise provenance and qualifiers" for more detail.

Typically, each GBOLDocument contains one or more NASequences (e.g. Chromosome, Contig, mRNA), which can have multiple features including all gene, exon, intron, sequence variations, and structural, regulatory and repeat annotations. Each gene is linked to its associated exons, introns and transcripts. Due to alternative splicing a gene can have multiple transcripts. Each transcript has its own unique list of exons, which is linked through the exonList and associated ExonList class to all associated exons. A transcript can be either a mRNA, ncRNA, rRNA, tmRNA, tRNA, precursor RNA or a miscellaneous RNA. The type of transcript determines the associated features: mRNA transcripts can have features linked to coding sequence (CDS), 5’-UTR, 3’- UTR and poly A tail. The mRNA translation table is defined with the translTable property within the parent sequence. The association between CDS and the encoded protein is preserved and information about the translation is stored if it is different from the default translation (for example, use of alternative stop codons).

Each protein has a unique IRI (http://gbol.life/0.1/protein/) based on the SHA-384 hash of its sequence. This makes it possible to combine protein information from heterogeneous sources, as a protein can be associated to several CDS features. All information related to the protein which is unique to the genome (such as location) should be stored in the CDS feature. Protein annotation features may include, among other, conserved regions, protein domains, binding sites, 3D structure, signal peptides, transmembrane regions, and immunoglobulin regions.

For prokaryotic genomes each gene is has only one transcript per gene, which in turn has one exon that overlaps with the complete gene, although care should be taken on those cases on which bacterial genes have been reported to contain introns. Operons can be defined with the Operon feature, to which other genomic features, such as genes, can be associated. Additionally, viral genome integration can be denoted using the IntegratedVirus feature.

Provenance

Three types of provenance can be distinguished. Metadata refers to the biological origin of the samples. Dataset- and element- wise provenance pertain the annotation process. An overview of the document structure is given in Figure "Document structure".

Document structure

Figure "Document structure":Graphical view of the GBOL Document structure. An explanation of the classes is provided in the main text

All data within a single data collection stored in GBOL is based on the GBOLDataSet, which holds among other, references to all included samples, sequences, organisms, annotation results and linked databases. An overview of the document structure is given in figure "Document structure". Furthermore, a version, a set of keywords, a document type, additional cross references, additional literature references and additional notes can be included. If the document has been published then it can described with the PublishedGBOLDataSet, which holds versions numbers, publication dates and a abstract.

A sequence originates from a sample and the sample are related to one or multiple organisms. The sample property which links to the Sample class describes where, when, how, by whom and from what the sample was collected. The fields follow the GenBank format. The organism property describes the taxonomic reference, its scientific name and its taxonomic lineage. The source feature can be used to denote that a subparts of the sequence originates from a different sample or is associated to a different organism.

All annotations made within the GBOLDataSet have associated provenance and should originate from one of the listed annotation results, so that correspondence with originating databases is preserved. The Database and the GBOLDataSet classes are both sub classed from the void ontology, Dataset class that contains a general description, including among other title, description, comment, license, version, data download address, SPARQL endpoint URI, and URL encoding.

Dataset-wise provenance

Figure "Dataset-wise provenance": Graphical view of the GBOL Dataset-wise provenance. An explanation of the classes is provided in the main text

Storage of the dataset-wise provenance is based on the PROV-O ontology in which the Entity, Agent and Activity classes are central. An activity can use and generate entities, which are executed (wasAssociatedWith) by an agent. As a result, an entity can be attributed to an agent.

The GBOLDataset, AnnotationResult, GBOLLinkSet and Database classes (indicated in Figure "Dataset-wise provenance" and "Document structure") are subclasses from the PROV-O ontology Entity class, so that for each of these objects provenance on how, when and by whom they were created can be associated.

In GBOL an Entity is either a file or an annotation result. The annotation result is a set of triples contained within a GBOL document, whereas a file represents a physical file either on a computer or network. An agent can either be a curator, person, organization or annotation software. For the annotation software a version and code repository with associated commit identifier is included to enable univocal identification. For a curator, an ORCID must be specified so that each curator can be uniquely identified together with his/her organization. Both Person and Organization are sub-classed from the FOAF ontology to include additional information such as name and email.

Within GBOL, each activity is an annotation activity, which can be either an automatic process or a manual curation activity, with a start and end time. An automatic annotation must be associated with a software agent and the set of parameters used must be specified including the corresponding input and/or output files. Finally, manual curation must be associated with a curator.

Element-wise provenance and qualifiers

Element-wise provenance

Figure "Element-wise provenance: Graphical view of the GBOL element-wise provenance. An explanation of the classes is provided in the main text

In addition to the dataset-wise provenance, GBOL stores and additional layer of element-wise provenance, as the provenance of all the annotation in GBOL is captured per property per feature with the FeatureProvenance, as shown in Figure AA. For properties that could have items from multiple sources, we have defined the Qualifiers, each with its own associated provenance. A qualifier can either be a citation, note or cross reference (indicated by xref). A citation can hold a reference to literature encoded with the BIBO ontology.

Annotations are linked to the Provenance object either through the provenance property of the qualifiers or the onProperty property of the Provenance class. The onProperty property list all the properties of the features to which the provenance applies. A special value ‘gbol:Existance’ is listed if the provenance is linked to the existence of the feature itself. The Provenance object links to both the dataset-wise provenance and the element-wise provenance.

The origin links the provenance with the dataset-wise provenance (AnnotationResult), which includes among other the creation time, identity of the creating agent and the used parameters, as previously mentioned.

The annotation links the provenance with the element-wise provenance (ProvenanceAnnotation), which includes: A free text note to describe the annotation; A list of references support the note; An experimental code, preferably from the Evidence Ontology to qualify the evidence supporting the conclusion; A optional derivedFrom feature that links to other features on which it is based.

Finally, each annotation tool generates its own evidence statements, often embedded in a statistical framework characteristic of the algorithmic approach as E- or p-values, bit scores, matching regions or any other scoring system. To store this information a subclass of the ProvenanceAnnotation class can be created. Some example classes include Blast, HMM and SignalP associated with the out of corresponding tools. However, these classes are not part of the GBOL ontology itself.