File format for genomic features
General Feature Format
- Filename extensions:
.gff,.gff3 - Internet media type:
text/gff3 - Developed by: Sanger Centre (v2), Sequence Ontology Project (v3)
- Type of format: Bioinformatics
- Extended from: Tab-separated values
- Open format? Yes
- Website:
github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
In the sprawling, often tedious, landscape of bioinformatics, the general feature format—variously known as gene-finding format, generic feature format, or simply GFF—emerges as a foundational file format. Its primary, and rather unglamorous, purpose is to meticulously describe genes and a myriad of other identifiable characteristics found within the intricate stretches of DNA, RNA, and protein sequences. Think of it as the geological survey map for the microscopic world, detailing every interesting rock, fault line, and potential gold deposit. It's a necessary evil, ensuring that the vast, complex data generated by sequencing projects can actually be understood and shared among researchers, preventing utter chaos. Without such a standardized format, the sheer volume of biological information would remain an inscrutable mess, rendering much of modern genomic research… well, pointless.
GFF Versions
Like all good intentions, the initial iterations of GFF proved to be… adequate, until they weren't. The evolutionary path of GFF is a testament to the human condition: build something, find its flaws, then build something slightly less flawed.
- General Feature Format Version 2: This version, while groundbreaking at the time, is now generally considered deprecated. It served its purpose, much like a first draft, but its limitations became glaringly apparent as biological data grew in complexity and our understanding deepened. It was a good start, if you're into that sort of thing.
- Gene Transfer Format 2.2: A direct descendant, or perhaps a slightly mutated offshoot, of GFF2, this format is notably used by Ensembl. It addressed some immediate needs but inherited many of its progenitor's structural quirks, leading to its own set of constraints.
- Generic Feature Format Version 3 (GFF3): This is the current standard, the prodigal son that sought to rectify the perceived sins of its predecessors. GFF3 was developed with an eye towards greater flexibility and semantic clarity, aiming to encapsulate the increasingly nuanced understanding of genomic features. It's still a GFF, but it tries harder.
- Genome Variation Format (GVF): This is a specialized variant that builds upon the GFF3 framework, introducing additional pragmas and attributes specifically designed for describing
sequence_alterationfeatures. Because, apparently, even GFF3 couldn't quite capture all the nuances of genomic variability.
The fundamental issue with GFF2 and its close relative, GTF, was a rather glaring deficiency: they could only represent feature hierarchies up to two levels deep. This immediately presented a problem for something as inherently layered as a gene, which typically involves a three-level structure: a gene containing one or more transcripts, which in turn comprise multiple exons. It was like trying to describe a skyscraper with only two floors. GFF3, in its infinite wisdom (or rather, its developers' belated realization), explicitly tackled this limitation. It supports an arbitrarily deep hierarchy of features, allowing for a more accurate and comprehensive representation of complex genomic architectures. Furthermore, GFF3 introduced specific meanings for certain tags within its attributes field, moving beyond mere descriptive text to a more structured and machine-readable semantic framework. This makes it significantly more robust for automated parsing and analysis, reducing the ambiguity that plagued earlier versions.
It’s worth noting, for those who appreciate historical footnotes, that the GTF is, in essence, functionally identical to GFF, version 2. This close relationship is often a source of minor confusion, a small testament to the naming conventions in bioinformatics that sometimes feel designed to test one's patience.
GFF general structure
All GFF formats—GFF2, GFF3, and GTF—adhere to a rather rigid, tab-delimited structure. Each line is comprised of exactly nine fields. While they all share the fundamental framework for the first seven fields, their true divergence lies in the content and specific formatting of the ninth field. It's a classic case of "mostly the same, except for the parts that aren't."
In a commendable, albeit somewhat belated, effort to mitigate confusion, GFF3 saw some field names altered. For instance, the first field, now known as "seqid," was previously referred to as "sequence." This change was made to prevent it from being conflated with the actual nucleotide or amino acid chain itself, a distinction that, to some, might seem obvious, but apparently warranted explicit clarification.
The general structure, the bedrock upon which all GFF files are built, is as follows:
General GFF3 structure
| Position index | Position name | Description |
|---|---|---|
| 1 | seqid | The rather specific name of the sequence where the feature is located. This is not the sequence itself, merely its identifier. |
| 2 | source | The algorithm or procedure that had the dubious honor of generating this particular feature. This field typically identifies the specific software or the authoritative database that performed the annotation. It tells you who or what claimed this feature exists. |
| 3 | type | The descriptive name of the feature's type, such as "gene," "exon," or "CDS." In a well-structured GFF file—and one can only hope they are—child features are expected to follow their parent features in a contiguous block. For example, all exons belonging to a specific transcript should appear immediately after their parent "transcript" feature line and before any other parent transcript. In GFF3, all features and their relationships are meant to be strictly compatible with the standards meticulously laid out by the Sequence Ontology Project, ensuring a degree of semantic consistency that was historically… aspirational. |
| 4 | start | The genomic start coordinate of the feature. Crucially, this is defined with a 1-base offset. This is a subtle but significant detail, contrasting sharply with other common sequence formats, such as BED, which typically employ a 0-offset, half-open coordinate system. A frequent source of off-by-one errors for the unwary. |
| 5 | end | The genomic end coordinate of the feature. Like the start coordinate, this also uses a 1-base offset. While it might seem counterintuitive given the start coordinate's difference, this end coordinate is actually identical to the end coordinate found in 0-offset half-open sequence formats like BED. This particular detail has, at times, warranted a citation needed, a testament to the minor complexities of coordinating coordinate systems. |
| 6 | score | A numerical value, often floating-point, that generally serves as an indicator of the source's confidence in the annotated feature. A higher score typically implies greater certainty. If the value is simply "." (a dot), it signifies a null or undefined value, meaning the source either didn't provide a score or had no confidence whatsoever. |
| 7 | strand | A single character indicating the strand of the feature. This can be "+" for the positive strand (or 5'->3'), "-" for the negative strand (or 3'->5'), "." if the strand is undetermined or irrelevant, or "?" for features where the strand is relevant but remains unknown. Because sometimes, even the universe is ambiguous. |
| 8 | phase | This field is specifically relevant for CDS (Coding DNA Sequence) features. It can take one of three integer values: 0, 1, or 2. For any other feature type, this field is denoted by "." (a dot). A more detailed explanation follows, for those who enjoy parsing the finer points of genetic translation. |
| 9 | attributes | A flexible, semi-structured list of tag-value pairs, each separated by a semicolon. This field serves as a catch-all for any additional, supplementary information deemed relevant to the feature. It's where all the extra metadata gets dumped, often in a format that requires careful parsing. |
The 8th field: phase of CDS features
To be blunt, CDS stands for "Coding DNA Sequence." The precise definition of this term, along with countless others, is meticulously curated by the Sequence Ontology (SO). As per the GFF3 specification, which one should ideally consult before making assumptions:
For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.
In simpler terms, because not everyone lives and breathes bioinformatics specifications, the phase tells you how many bases at the start of the current CDS segment need to be skipped to ensure the subsequent bases align correctly with a codon boundary. It's crucial for understanding how the sequence will be translated into protein and ensures that the reading frame is maintained across discontinuous CDS segments (like those interrupted by introns). A phase of 0 means the feature begins with a complete codon; 1 means one base needs to be removed; 2 means two bases need to be removed. It's a small detail, but one that can utterly derail protein prediction if ignored.
Meta Directives
Beyond the structured data lines, GFF files also permit the inclusion of supplementary meta-information. This additional context is typically introduced by lines beginning with a double hash symbol (##). These "meta directives" can specify, for instance, the GFF version being used, delineate specific sequence regions, or identify the species to which the genomic features belong. A comprehensive list of all permissible meta data types can be found within the Sequence Ontology specifications, for those who enjoy reading technical documentation. It's the kind of information that's not part of the main story, but absolutely essential for understanding it.
GFF software
The utility of any file format is, of course, entirely dependent on the software designed to interact with it. GFF, being a cornerstone in genomics, has naturally spawned a variety of tools.
Servers
These are the systems that typically generate or provide GFF data, serving it up for consumption by researchers and client applications.
| Server | Example file |
|---|---|
| UniProt | [1] |
UniProt, a comprehensive, high-quality resource for protein sequences and functional information, is one such server that provides data in formats compatible with GFF standards, allowing users to integrate protein-centric annotations with genomic context.
Clients
These are the applications that consume, visualize, and analyze GFF data, transforming raw coordinates and types into something intelligible.
| Name | Description | Links GFF is a particularly delightful example of how biologists, in their infinite wisdom, have opted to create a relatively simple, yet consistently frustrating, way to describe the features of DNA, RNA, and protein sequences. If you thought tracking your own life events was complicated, try mapping every tiny detail of an organism's genome.
Validation
Given the intricate nature of these formats and the sheer volume of data they handle, errors are not just possible, but practically inevitable. This is where validation tools become less of a suggestion and more of a necessity. They exist because, apparently, humans cannot be trusted to meticulously adhere to a 9-field, tab-delimited standard without supervision.
The modENCODE project, a rather ambitious endeavor to identify functional elements in the genomes of fruit flies and worms, offers an online GFF3 validation tool. It's surprisingly robust, boasting generous limits of 286.10 MB and an impressive 15 million lines. Because even large-scale genomic projects generate their fair share of questionable data.
For those who prefer a more hands-on approach, or perhaps distrust the internet with their precious, error-riddled files, the Genome Tools software collection includes a gff3validator tool. This can be utilized offline to validate and, if you're lucky, even tidy up errant GFF3 files. An online validation service is also available, for those who appreciate options, or simply haven't learned their lesson about relying on web services.
See also
For those whose curiosity extends beyond the mere structure of GFF, a few related topics might prove… illuminating: