Reference sequence sources

Category REFERENCE SEQUENCE SOURCES
Genome Reference Consortium (GRC) GENCODE RefSeq Locus Reference Genomic (LRG)
PURPOSE Maintaining and updating reference genome sequences, including for human. Enhancing and extending the annotation of all evidence-based gene features in the human genome at a high accuracy Providing a comprehensive, integrated, non-redundant, well-annotated set of sequences (genomic, transcript and protein). Creating stable reference sequence records that are used for reporting sequence variants with clinical implications.
PRODUCT The reference genome assembly for human consists of the primary assembly and alternate loci (patches). Patches either represent error corrections (FIX patch) or alternate alleles (NOVEL patch). GENCODE genes, transcripts and proteins. The GENCODE gene set is is made by merging manual annotation created by the Ensembl-HAVANA team and automated evidence-based annotation from the Ensembl Genebuild. RefSeqGene (a region of genomic DNA encompassing and flanking the transcribed region of a gene), RefSeq transcripts, RefSeq proteins. LRG records contain stable and thus, un-versioned reference sequences designed specifically for reporting sequence variants with clinical implications.
SOURCE A collaboration between the Sanger Institute, EMBL-EBI, the NCBI and McDonnell Genome Institute. A collaboration led by Ensembl at EMBL-EBI, based near Cambridge, UK and including partners from Yale University, UCSC, MIT, CRG, CNIO and the University of Lausanne. The NCBI, based in Bethesda, MD, USA. A collaboration between EMBL-EBI and the NCBI RefSeq team.
ACCESSION
IDENTIFIER
GRChxxx; e.g. GRCh38
Assembly release with patches: GRChxxxpX; e.g. GRCh38p7
Gene identifier format: ENSGxxx; e.g. ENSG00000012048
Transcript identifier format: ENSTxxx; e.g. ENST00000357654.7
Protein identifier format: ENSPxxx e.g. ENSP00000350283.3
Gene identifier format: NG_xxx; e.g. NG_005905.2
Transcript identifier format: NM_xxx; e.g. NM_007294.3
Protein identifier format: NP_xxx; e.g. NP_009225.1
Gene identifier format: LRG_xxx; e.g. LRG_292
Transcript identifier format: LRG_xxxtxxx; e.g. LRG_292t1
Protein identifier format: LRG_xxxpxxx; e.g. LRG_292p1
STABILITY
Version very infrequently.
Major updates (sequence and structure changes, which may disrupt chromosome coordinates) are indicated by the number after GRCh (e.g. GRCh37 to GRCh38).
Minor updates (the addition of patches) can occur quaterly, and are indicated by the digit after 'p' e.g. GRCh38.p10. Patch updates do not disrupt the primary assembly chromosome coordinates.
The GRC has not announced plans to release a GRCh39 assembly.
Version.
Updates denoted by the final digit in the accession number (after the full stop/period); e.g. ENSTxxx.1
Updates are issued in batches (e.g. GENCODE release 26) as part of an Ensembl release (e.g. Ensembl release 88). This is normally every 2-3 months.
Version.
Updates denoted by increment to the numeric version after the decimal; e.g. NM_xxx.2
Individual sequence updates are available on an ad hoc basis and batch released at a later date (e.g. RefSeq release 81).
Do not version.
Once an LRG has been made public, its “fixed” section, which contains reference sequences and exon numbering, will never change.
The “updatable” section, which contains mappings, annotations and community information is updated weekly.
SEQUENCE The primary reference assembly is a composite of sequence from 13 individuals and therefore does not necessarily represent the major alleles of any given population. GENCODE sequences always match the genome reference assembly. RefSeq sequences don’t necessarily match the genome reference assembly. RefSeq, to the extent for which this is possible, represent a prevalent, 'standard' allele. The default implementation of 'standard allele' is the sequence from the GRCh38 primary assembly. If, however, there is evidence that the GRCh38 sequence is not standard, the sequences can be constructed from an alternate source sequence, or locally modified. As LRGs are based on RefSeq sequences they do not necessarily match the primary reference assembly. However, it is our default policy to match the primary reference assembly, unless there is a convincing argument that an alternate allele is more appropriate.
ANNOTATION
The process of finding and designating locations of individual genes and other features on raw DNA sequences
Genome assemblies (GRCh37 or GRCh38) are un-annotated genome builds. Annotation is provided by Ensembl (GENCODE) and the NCBI (RefSeq).
Both manual and automated gene annotation approaches utilise primary transcriptomic and proteomic data aligned to the reference genome to determine transcript stucture and CDSs.
Manual annotation also incorporates datasets that capture TSS and transcript 3’ ends, epigenetic and transcription factor binding data as well as cross-species conservation at both the sequence and transcript level to refine structural and functional annotation.
The resulting manual annotation is merged with the automated annotation via a hierarchical method that gives precedence to manual annotation to produce the GENCODE gene set.
RefSeq curators at the NCBI review a variety of transcript, proteomic, epigenomic, and variation data to annotate a set of well-supported and biologically valid transcripts and their encoded proteins for each gene.
The curated transcripts and proteins are aligned to the genome and combined with additional computational models to generate a comprehensive genome annotation.
Genomic RefSeqGene records are then defined for a subset of genes, and annotated by aligning the curated transcripts to the genomic sequence, as well as projection of features for neighboring genes from the genome annotation.
LRG curators review both RefSeq and GENCODE locus annotion as well as reviewing supporting evidence. They work with RefSeq and GENCODE annototers to update locus annotation with a view to create matching transcript models for inclusion in the LRG record.
INTERACTION
WITH LRG
LRG curators may request a review by the GRC if the primary reference assembly contains a non-standard allele and may request the creation of patches, if required by the community.
At request from LRG curators, GENCODE annotators review annotation of clinically relevant transcripts.
GENCODE sequences are included in LRGs. Any mismatches are clearly shown.
The LRG is a collaboration with RefSeq. LRG records contain RefSeq sequences.
At request from LRG curators NCBI curators review annotation of clinically relevant transcripts.
LRG curators can request NCBI curators to change RefSeq alleles to the ‘standard allele’ as determined by population data or expert community input.
Communication with GENCODE and RefSeq is aimed at ensuring that the highest quality, ideally matching, model exists for inclusion in the LRG.
LRG curators can feed back input from the clinical community to RefSeq, GENCODE and the GRC.
REFERENCES
Other sources of annotation and relevant projects
APPRIS APPRIS is a system to annotate alternatively spliced transcripts based on a range of computational methods. LRG curators use this information to help identify well-supported transcript models (Rodriguez JM et al., 2013).
CCDS The CCDS project is a collaborative effort to identify a core set of protein coding regions that are consistently annotated and of high quality. A combination of manual and automated genome annotations provided by (NCBI) and Ensembl (which incorporates manual HAVANA annotations) are compared to identify annotations with matching genomic coordinates.
EMBL-EBI The European Bioinformatics Institute (EMBL-EBI) shares data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academia and industry. It is part of EMBL, Europe’s flagship laboratory for the life sciences.
Ensembl Ensembl, based at EMBL-EBI, is a project to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes (Aken BL et al., 2017).
HAVANA High-quality gene models produced by the manual annotation of vertebrate genomes. Previously based at the Sanger Institute, has recently moved to EMBL-EBI
HGMD The Human Gene Mutation Database (HGMD®) represents an attempt to collate known (published) gene lesions responsible for human inherited disease. LRG curators use this information to inform community used transcript models.
HGNC HGNC (HUGO gene nomenclature committee) is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.
HGVS The Human Genome Variation Society issues guidelines and recommendations on the nomenclature of gene variations. den Dunnen et al., 2016
LOVD The Leiden Open Variation Database (LOVD) provides a tool for Gene-centered collection and display of DNA variations and aggregates variants from locus-specific databases (LSDBs). LRG curators refer to LOVD LSDBs to help establish which transcript models are in use by the community.
NCBI The USA’s National Centre for Biotechnology Information advances science and health by providing access to biomedical and genomic information.
TGMI The Transforming Genetic Medicine Initiative (TGMI) is building the knowledge base, tools and processes needed to deliver genetic medicine.
UCSC Provides UCSC specific annotations on GRCh37, these are based on RefSeq annotation. As of July 29, 2015 GENCODE annotations are the default annotations on GRCh38 in the UCSC genome browser. UCSC annotations are no longer being generated. Together with Ensembl, UCSC supplies Transcript Support Levels (TSL) for all GENCODE transcripts. LRG curators use this information to identify well-supported transcript models.
UniProt The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR).
VEP VEP (Variant effect predictor) determines the effect of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence. It is produced by Ensembl at EMBL-EBI. McLaren W et al., 2016