Jump to notes about: Assembly Annotation Map Viewer Reports
Assembly
This build adds the HuRef assembly and includes with no changes the reference genome assembly, the alternate Celera assembly, and partial chromosome alternate haplotypes.
Annotation
This build represents an annotation update. New and updated RefSeq and GenBank transcripts and proteins were aligned to the genome assemblies and new annotation models were calculated. The annotation methods were modified as follows:
Annotation of non-coding RNAs: The annotation method was modified to annotate non-coding RNA explicitly as ncRNA. Improved tracking of annotation across assemblies: Added use of assembly-to-assembly genome alignments during the annotation process. Novel annotations determined to be from comparable locations on more than one assembly were assigned the same GeneID, symbol, and full name.
Map Viewer
Asynchronous updates for some maps:
Variation: an update for the Variation track will be released at a later date. CCDS: analysis is underway to calculate an update for the human CCDS dataset.
Display of Celera's annotation of genes and transcripts
Celera Genes (craGenes) is a new option to display Celera's annotation of genes on the Celera assembly Celera transcripts (craRNA) is a new option to display Celera's annotation of transcripts on the Celera assembly.
Reports
ASN files:Chromosome sequence data reported in ASN.1 format include alignments of RefSeq records that do not match the genome sufficiently to support mapping annotation from the RefSeq to the genome (e.g., indels, unaligned 5' end, gene segments such as immunoglobulins, and some gene clusters). These alignments are visible in the new NCBI graphical sequence viewer, and in the graphical user interface (GUI) tool Genome Workbench.
Jump to notes about: Assembly Annotation Map Viewer
Assembly
The genome assembly did not change with this update; it is identical to that provided for build 36.1.
Annotation
This build represents an annotation update. New and updated RefSeq and GenBank transcripts and proteins were aligned to the genome and new annotation models were calculated. The annotation methods were modified for this build resulting in a significant reduction in the number of predicted splice variants (XM_ accessions).
Masking: The program RepeatMasker was used to mask repetetive genomic sequences before making blast hits.
Transcript and Protein Alignments: Proteins were screened to filter out redundant proteins and those that contain repetitive sequences. The protein-to-genome alignment program, ProSplign, was improved to reduce the number of short or fragmented alignments.
Predicting splice variants: Gnomon was modified to improve calling the translation start codon to avoid fusing neighboring but distinct coding regions together. In addition, modifications were implemented to prevent predicting splice variant models if the model is calculated to be subject to nonsense-mediated decay (NMD), includes an internal stop codon, or if the protein appears to be frameshifted. Alternatively spliced predictions that appear to represent a partial CDS or appear to be completely contained within another longer model prediction are discarded.
Map Viewer
FTP: [README]
A concatenated FASTA file is now provided for non-transcribed pseudogene annotations, and for immunoglobulin and T-cell receptor gene segments. The org_transcript.gtf.gz and zoo_transcript.gtf.gz files are now provided in GFF format. microRNA annotation is now included in the FTP files
Variation map: The variation map calculated from data in dbSNP is available for this build.
CCDS: Gene and RefSeq transcript annotations that are associated with a CCDS ID and which have not been curated to withdraw or modify the annotation were propagated to build 36.1 based for regions where the underlying assembly component did not change. These have been propagated to build 36.2. [More...]
Jump to notes about: Assembly Annotation Map Viewer
Assembly
This build includes the reference assembly, a whole genome and single chromosome alternate assembly, and partial chromosome alternate haplotypes.
reference assembly: RefSeq records representing the official reference genome assembly include sequences assembled into chromosomes 1 through Y (RefSeq accessions NC_000001 through NC_000024), the human mitochondrion, and RefSeqs representing sequences that were not localized on a chromosome. The reference sequence assembly is based on genomic sequence data available in August 2005.
Accession Chromosome Sequence change
NC_000001.9 1 yes
NC_000002.10 2 yes
NC_000003.10 3 yes
NC_000004.10 4 yes
NC_000005.8 5 no
NC_000006.10 6 yes
NC_000007.12 7 yes
NC_000008.9 8 no
NC_000009.10 9 yes
NC_000010.9 10 yes
NC_000011.8 11 no
NC_000012.10 12 yes
NC_000013.9 13 no
NC_000014.7 14 no
NC_000015.8 15 no
NC_000016.8 16 no
NC_000017.9 17 no
NC_000018.8 18 no
NC_000019.8 19 no
NC_000020.9 20 no
NC_000021.7 21 no
NC_000022.9 22 yes
NC_000023.9 X yes
NC_000024.8 Y yes
NC_001807.4 mitochondrion no
Celera assembly: The November 2001 combined whole genome shotgun assembly for chromosomes 1-Y ; this assembly includes both WGS and BAC sequence data and does not include gap insertions to represent telomeric, centromeric, or heterochromatin regions. In contrast, build 35.1 included the Celera WGS-only assembly. The Celera assembly annotation shown was computed using NCBIs genome annotation pipeline.
Accession Chromosome
AC_000044.1 1
AC_000045.1 2
AC_000046.1 3
AC_000047.1 4
AC_000048.1 5
AC_000049.1 6
AC_000050.1 7
AC_000051.1 8
AC_000052.1 9
AC_000053.1 10
AC_000054.1 11
AC_000055.1 12
AC_000056.1 13
AC_000057.1 14
AC_000058.1 15
AC_000059.1 16
AC_000060.1 17
AC_000061.1 18
AC_000062.1 19
AC_000063.1 20
AC_000064.1 21
AC_000065.1 22
AC_000066.1 X
AC_000067.1 Y
CRA_TCAGchr7v2 chromosome 7 assembly: chromosome 7 assembly and annotation based on data provided by The Chromosome 7 Annotation Project in December, 2005.
Accession Chromosome
AC_000068.1 7
DR53 haplotype: an alternate haplotype of the MHC region; annotation is provided in a curated RefSeq genomic record NG_002433
c22_H2 haplotype: An alternate chromosome 22 assembly that contains the CYP2D6 gene (NT_113959.1). CYP2D6 is deleted in the reference assembly.
c5_H2 haplotype: A chromosome 5 alternate assembly of the SMN1 gene region. (NT_113801.1, NT_113802.1 )
c6_COX haplotype: An A1-B8-DR3 alternate haplotype assembly of the chromosome 6 MHC region based on sequence data from the COX library (NT_113891.1). See the Wellcome Trust Sanger Institute MHC Haplotype Project web site for additional information.
c6_QBL haplotype: An A26-B18-DR3 alternate haplotype assembly of the chromosome 6 MHC region based on sequence data from the QBL library (NT_113892.1, NT_113893.1, NT_113894.1, NT_113895.1, NT_113896.1, NT_113897.1). See the Wellcome Trust Sanger Institute MHC Haplotype Project web site for additional information.
Annotation
Masking: The NCBI program WindowMasker was used to mask repetetive genomic sequences before making blast hits. In previous builds RepeatMasker was used.
Contamination Screening: Screening for vector contamination, insertion elements, and other foreign sequences has improved since the previous build.
Transcript and Protein Alignments: Splign is used to compute cDNA-to-genomic alignments that account for introns and splice signals. An earlier version of Splign was used in build 35.1. A new program, ProSplign, has been developed at NCBI to generate optimized protein-to-genomic alignments that account for splice site location. ProSplign alignments are now used in the prediction of new gene and pseudogene models. The proteins used consisted of the human RefSeq protein collection (accession prefix NP_) plus proteins annotated on human cDNAs available in GenBank.
Predicting splice variants: Gnomon, NCBI's computational gene annotation program, now predicts splice variants based on both transcript and protein alignments. Splice variant predictions are available in the Ab initio track of Map Viewer and a subset that are supported by transcript or protein alignments are instantiated as new models (with XM_ or XR_ accessions) for genes that are not represented by a curated RefSeq (with accession prefix NR_, NM_, or NG_).
More genes predicted:More genes were annotated in this build because multi-exon Gnomon predictions that overlap with RefSeq-based annotations were treated as different genes if the CDS does not overlap. In addition, Gnomon predictions of genes located within introns of another gene are also preserved. In previous builds, predictions that overlapped with, or were contained within, RefSeq-based annotations were discarded.
Use of curation hints: As part of the RefSeq curation that occurs between builds, some genes are identified that are known to NOT be included in the reference assembly. These curatorial hints are now used to prevent incorrect annotation of a gene on the reference assembly based on RefSeq alignment to the chromosome location at which the gene may be found in another haplotype. Should the RefSeq also align elsewhere in the genome, that alignment information may still be considered in the context of predicting a new related gene or pseudogene.
Use of multiple RefSeq alignments: In previous builds, when a RefSeq mRNA aligned to more than one location at high quality, only one best placement was annotated. In build 36, the RefSeq and gene are annotated at the best alignment location and additional alignments are considered in the context of predicting new related genes or pseudogenes. New models predicted based on RefSeq mRNA alignments are assigned a model accession with the prefix XM_ or XR_.
Alternate Assemblies: All assemblies were annotated using the same process flow and parameters. Annotation of the Celera and other alternate assemblies changed in this build to include processing of more transcript and protein alignments and consideration of alignment data in the context of ab initio prediction results.
Feature annotation:
Variation: Calculation of variation annotation is no longer synchronized with the whole genome annotation pipeline and is not annotated on the sequence records provided for FTP. Variation data is available from the dbSNP FTP site ( ftp://ftp.ncbi.nih.gov/snp/). DEFLINE: Transcript and protein RefSeq records that are predicted by the genome annotation pipeline have a DEFLINE that begins as "PREDICTED". This provides an additional distinction between the products that are calculated by the NCBI computational annotation process flow (XM_, XR_, XP_ accession prefix) vs those that are based directly on GenBank submissions (NM_, NR_, NG_, NP_ accession prefix).
Map Viewer
New Feature: The Map Viewer has been expanded to support accessing a previous build. This new feature is available for the first time with the release of human build 36.1. NCBI build 35.1 can still be accessed from the Map Viewer home page list of organisms. In addition, it is possible to switch the display between the current and previous builds from the human genome overview page and from the chromosome display pages. Links to the genome overview page for both build 36.1 and 36.1 are now provided in the left column of chromosome display pages. Displays of build 35.1 include prominent red text "Build 35.1 (Previous)" at the top of the page and in the blue column to indicate that an older dataset is being displayed. [Jump to: Build 35.1 Build 36.1]
Scaling of Celera chromosomes: The reference assembly chromosomes include large gaps to represent the telomeres, centromeres and heterochromatin regions. Celera did not insert these gaps in their chromosome assemblies. The nominal length of the Celera chromosomes is therefore much shorter than the length of the reference chromosomes. The Celera maps are therefore scaled differently and do not have the large gaps with no contigs or genes that are seen for the reference chromosomes.
Variation map: The Variation map was not available at the time of build 36.1 release. It will be added at a later time.
Version 1
Assembly
Reference sequences for two alternate haplotypes in the Major Histocompatibility Complex (DR52 NG_002392 and DR53 NG_002433) are included in this build. (The Major Histocompatibility Complex in the reference assembly matches the DR51 haplotype.)
Access to another alternate assembly of the human genome has been added with this build, namely the December 2001 whole genome shotgun assembly (WGSA) generated by the Celera Assembler applied to shotgun data only: the 27 million reads of Celera's 5.3X whole genome shotgun data and 104,000 BAC end sequence pairs from GenBank from other human genome projects.
This build includes version 1 (March 2003) of the assembly of chromosome 7 from the The Center for Applied Genomics, TCAG, termed HSC_TCAG. Version 2 of this assembly (April 2004) became available while the NCBI build 35.1 was in progress.
Annotation
Annotation of variation (SNPs):
Placement of variation on the genome became available for this annotation build (35.1) on November 8, 2004. These annotations of variation correspond to the release of dbSNP Build 123.
First use of PREDICTED in annotation of human.
The word PREDICTED has been added to the title of RNA records with accessions beginning with XM and XR and to the title of protein accessions beginning with XP. PREDICTED therefore appears in the definition line seen in retrievals from Entrez nucleotide and protein and in BLAST results.
PREDICTED means that these sequences are derived from genomic placements and not directly from a cDNA. It does not mean that the gene itself is predicted, although it may be.
Partial annotation of the Celera assembly
Genes were placed on the Celera genome assembly only by alignment of RefSeq RNAs. GenBank mRNAs and ESTs were not aligned to genomic contigs from Celera, nor was Gnomon run on the Celera assembly.
Map Viewer
The method of converting sequence to cytogenetic bands was changed in Build 35.1. Thus the content of the ideogram file on the ftp site has changed slightly. The conversions were provided by Terry Furey: Furey, TR; Haussler D, Integration of the cytogenetic map with the draft human genome sequence. Hum Mol Genet. 2003 May 1;12(9):1037-44. PubMed
Maps added as part of this release include:
Repeats: alignment of common repetitive elements
The May, 2002 version of RepeatMasker was executed using these flags:
-w flag --invoking MaskerAid -no_is -cutoff 255 -frag 20000
. The placement ids from RepeatMasker were retained to facilitate individual integration events.
GgaUniG: alignment of chicken mRNAs, labeled according to the UniGene cluster to which they belong. Gga ESTS: alignment of chicken mRNAs (including ESTs)
Data for the Celera assembly is not available for all the maps produced for the reference assembly .
Version 2
With the release of this version, the link for Gene-specific information was changed from LocusLink to Gene.
Annotation differs from that in Build 34.1 in that:
dbSNP annotation is from a new build. UniGene clusters were updated. The alignment algorithm for placing mRNAs was modified to improve using canonical splice junctions for exons and to allow more overlapping gene models.
Version 1
Assembly
In addition to the reference assembly (termed
ref in the Maps&Options box), Map Viewer displays a reference sequence for the
DR51 haplotype in the Major Histocompatibility Complex (
NG_002432) and the assembly of chromosome 7 from the
The Center for Applied Genomics, TCAG, termed
HSC_TCAG.
This Build is the first to include the pseudoautosomal region of the Y chromosome in the assembly, the resultant contigs, and the feature annotation.
Annotation
Versioning of annotation
The version of a set of annotation is now displayed on the Map Viewer page. A version will be incremented if data in map or maps is updated, for example if a new dbSNP build is released and the Variation map is changed accordingly. The statistics page now supports reporting by version. In contrast, a Build is incremented only with a change in the reference sequence (assembly) itself, and the initial version for that new build is set as one(1).
Gene Annotation
The algorithm for placing mRNAs on the genome was improved to:
align small internal exons generate full-length alignments extending the alignment to cover short regions at the ends of the transcripts
These changes should be apparent in the UniGene and EST maps as well as the exon annotation on the Reference sequences.
The number of genes annotated on the reference genome has decreased, and the number of models identified as pseudogenes has increased. This is primarily due to a change in the algorithm used to model genes, mRNAs and proteins which gives more weight to coding propensity and matches to existing proteins, and checks more rigorously for changes in frame. This method, developed by Alexandre Souvorov and named Gnomon, has replaced GenomeScan as our standard method of predicting gene models. It is discussed in more detail here. In this method, any gene model that results in a frameshift or premature termination relative to a set of conserved proteins is flagged as a probable pseudogene. That pseudogene is retained as the annotation unless: (1) the gene model corresponds to the the best placement of a RefSeq mRNA from a protein-coding gene, or (2) the gene is identified as protein-coding by best placement of known mRNAs. If mRNA aligns well to the model, a model RNA product is generated (RefSeq accession of the format XR_xxxxxx), otherwise the gene is annotated as /pseudo with no product. Because of the above, there are now three sources of pseudogene annotation:
Alignment of a genomic RefSeq accession, with the pseudogene annotation transferred by alignment Alignment of a RefSeq RNA from a pseudogene (NR_xxxxxx) Evaluation of protein-coding propensity by Gnomon.
Map Viewer
Release of this build involves addition of several maps and changes in software.
New maps
Hs_EST: alignment of human ESTs and mRNAs not grouped by UniGene clusters Mm_EST: alignment of mouse ESTs and mRNAs not grouped by UniGene clusters Rn_EST: alignment of rat ESTs and mRNA not grouped by UniGene clusters Ssc_EST: alignment of pig ESTs and mRNA not grouped by UniGene clusters Bt_EST: alignment of cow ESTs and mRNA not grouped by UniGene clusters HSC_TCAG: RNAs annotated by the TCAG group. Functional only on the HSC_TCAG assembly.
Replacement map
Ab initio: replaces the GenomeScan for the comprehensive display of gene predictions.
Software changes
The Maps&Options tool was modified to support displays of selected types of annotation on different assemblies in the same view. For more details, please refer to the Map Viewer help documentation.
The reference DNA sequence of Homo sapiens was first made available for downloading here. On April 28-29, 2003, the sequence records were updated with current annotation and Map Viewer was updated to reflect that annotation. No changes were made in data processing between Build 32 and Build 33.
Assembly
The number of 'finished' chromosomes increased. Now chromosomes 6, 7, 9, 10, 13, 14, 18, 19, 20, 21, 22 and Y are considered complete.
Gene annotation
Color assigned to gene models
There is a modification in the use of color to convey information about the level of evidence supporting a gene and any conflicts in that annotation.
In prior builds, model genes would be orange if they exhibited any difference compared to the genome, including any translation or transcription discrepancy of the mRNA/CDS with respect to the referenced cDNA/protein product (such as a single gap). Before Build 31, even mismatches, such as SNPs, resulted in this color difference. In Build 32, the use of the orange color has been made even more limited:
there is less than 85% coverage of the mRNA transcription with respect to the cDNA product, the per cent identity of the transcription unit with respect to the cDNA product is less than 98%, or there is a gap in the coding region, that is, the translation of the coding region with respect to its referenced protein product is not limited to amino acid substitutions.
The file on the ftp site that corresponds to the gene annotation,
seq_locus.md has been modified to represent explicitly the coding and non-coding regions of each exon. Thus the
LOCUS lines have been replaced with
CDS and
UTR, respectively.
Map Viewer
Phenotype map
A new map was added in this build, namely a representation of phenotypes from OMIM in sequence coordinates. This map is called Phenotype in the Maps&Options selection, and is labeled as Pheno in query results page. Thus it is now easier, when querying by a disease name, to know if it has been placed on a sequence map at all.
If the phenotype is associated with a known gene, the sequence correspond to those of the gene. If the phenotype is placed by linkage or association to mapped markers, the phenotype is placed by the position of that marker or markers. At present, there is no step to extend the range defined by the markers to reflect the level of confidence in any boundary marker.
Representation of coding regions
Coding regions are now represented differently from non-coding on both the Gene and Transcript (RNA) maps. On the Gene map, this representation is the summary of all the coding regions (CDS) and untranslated regions (UTR) for each transcript, if several are annotated. Thus it is possible to have a Gene look as if it has UTR interspersed with CDS, as in the case where there is a shorter variant with a 3' UTR that does not include all exons of any longer variant. In those cases, it might help to add the RNA map to the display.
Assembly
The number of 'finished' chromosomes increased. Now chromosomes 6, 7, 13, 20, 21, 22 and Y are considered complete.
Gene annotation
In this build, there is a significant reduction in the number of genes (from 34,539 to 26,846 ) annoted on the NT_000000 accessions and displayed on the Gene/Sequence map. This reduction resulted from:
Increased number of longer mRNAs supporting the connection of models that were separate in previous builds A reduction in the number of ab initio predictions retained in the annotation. These predictions are still displayed in Map Viewer (GScan map) and are available for ftp and BLAST retrieval.
Map Viewer
Several maps were added in this build:
Placement of mRNAs, including ESTS, from cow and pig on the human map. Bt_UniGenecow(Bos taurus )Ss_UniGenepig(Sus scrofa )
Fosmid map
Assembly
The number of 'finished' chromosomes increased. Now chromosomes 6, 7, 20, 21, 22 and Y are considered complete.
The number of 'contigs' decreased from 2042 to 1395, based not only on more finished sequence, but also on the decision to retain 'contigs' composed of single BACS in the reference genome only if they contained a gene not found elsewhere in the assembly.
Gene annotation
In this build, there is a significant redunction in the number of genes annoted on the NT_000000 accessions and displayed on the Gene/Sequence map. This reduction resulted from:
Discontinuing the annotation of predicted gene placement when the mRNA aligments were better elsewhere in the genome and the number of exons predicted at the less optimal location was less than 2/3 of the number at the optimal location. Preventing annotation of overlapping genes unless the evidence for both was based on the best placement of RefSeq alignments.
Although we recognize that this may result in the removal of some valid models, we hope that the models from the ab initio predictions (GenomeScan) will fill any gaps.
Another change was to retain more RefSeq mRNA accessions in the annotation, based on the best placement in the reference genome or any haplotype.
Map Viewer
The deCodes genetic map was added between the release of Builds 29 and 30.
Assembly
Several significant changes were made in the assembly process in this build:
The curated tiling path of BAC clones (TPF) is now being used as a map. Definiton of overlap was relaxed.
Previously, for two draft accessions to align, there had to be at least one fragment pair aligning for at least 2500 bp at 99%. This was changed to 500 bp at 98%. Stringency of identity was relaxed with the goal of removing redundancy and decreasing warping. Short fragments from draft sequence contained in another fragment of the same clone were not included. Order and orientation information (ONO) within phase 1 sequences was determined by use of overlaps, transcripts, and plasmids. Doing ONO within a BAC has the advantage over ONO when that BAC is melded with other BACs in a contig. Priority was given to ONO order for phase 2 and parts of phase 1 BACs as determined above at the cost of possibly breaking some sequence overlap. Earlier, overlaps could break any ONO. ONO information was used to look for lower quality hits in cases such as X -> Y -> Z with X and Z hitting W but not Y.
Gene annotation
The following modifications were made to the gene annotation process.
More weight was given to determining the orientation of multi-exon ESTs by the splice junction sequences rather than the annotation. ESTs with multi-exon alignments having all splices as CT-AC reverse complement of consensus, and without contradictory alignments, were (re-)oriented and the model adjusted to the correct strand. Splice sites were adjusted to nearest GT-AG consensus, when present (within +/- 1 bp of splices predicted purely via alignment). Exons of RefSeqs aligned perfectly were not adjusted. The definition of gene boundaries was altered to allow clustering of model mRNAs into the same gene if one model mRNA was contained completely within another predicted exon, or if a model mRNA overlapped another predicted terminal exon. Included EST evidence supporting GenomeScan predictions. Model genes were kept distinct if the sole evidence for a join was a GenomeScan prediction. A more systematic approach to annotating pseudogenes was initiated. Provisional genomic RefSeq records (accession format NG_000000) were created for a small number of pseudogenes and positioned on the genome by nucleotide sequence alignment.
Map Viewer
Representation of expression data has been enhanced by the addition of the
SAGE tag map. Please note that this may not always be available at the same time as the bulk of the data release, and will be added later.
Assembly
There were no modifications to the assembly process relative to the previous build.
Gene annotation
This build identifies more genes than the previous because of the following modifications:
The minimum size of a predicted protein was reduced from 100 aa to 90 aa. Genes predicted by GenomeScan are being retained in cases where models failed any other criterion. Allowing gene models to overlap up to 100 nucleotides.
GenomeScan models are not instantiated as XM_000000 accessions if they overlap, on any strand and in excess of 100 bp, an alignment-based model such that the alignment-based model satisfied preliminary criteria, including ORF length and repeat masking.
Map Viewer
Although there were no changes in the maps provided and the methods of computing them, the
ModelMaker tool was added and changes were made to the look and feel of access to zoom functions and tools supporting configuring the display.
ModelMaker
ModelMaker, accessed by the mm link in the legend of the Gene_seq map, allows the user to view aligned mRNAs, ESTs, and GenomeScan models in a strand specific way. Information about each exon, its translation in all reading frames, and its putative splice junctions are provided to enable the user to evaluate evidence for determining all valid combinations of exons and reading frames, test the open reading frames determined by the combination selected, and produce a final 'mRNA' to copy and use in subsequent research.
Configuration
Maps&Options has replaced the previous Display Settings link. It has been made more obvious by a new contrasting background color, and by being accessible not only within the blue bar at the top of the screen but also from the blue column at the left.
Other changes in the basic display include:
Adding an option to set the format of the compact (thumbnail) view of the chromosome in the blue column at the left. Clarifying the zoom options in the left column by reducing the number of choices and providing a mouse over to indicate the fraction of the chromosome to display (from 1/10000, 1/1000, 1/100, 1/10 or 1/1). Within the Maps&Options box itself, allowing configuration of the thumbnail view. Making the name of the master map be red. Adding a link to the BLAST search page.
Assembly
There were no modifications to the assembly process relative to the previous build.
Gene annotation
There were major modifications in annotating genes and mRNAs. In previous builds, genes were annotated based only on mRNA alignments, and alignments were not extended based on EST evidence. The Map Viewer was used to indicate other potential gene locations based on GenomeScan predictions and/or EST alignments. In this build, however, a more comprehensive set of genes was annotated on the contig sequences (and thus viewable on the Genes_sequence map) based on the combination of mRNA alignments, EST alignments, and GenomeScan predictions. In particular:
RefSeq, mRNA, and EST alignments were used to identify the most prevalent and strongly supported splice sites and exon boundaries. Alignments were considered for ESTs when an alignment was >50% of that EST's length. Predicted (GenomeScan) models that overlapped any mRNA alignment-based model were removed. Thus to see the full set of predictions from GenomeScan, the GenomeScan map should be displayed.
Another major change was to reduce the amount of redundancy in the number of mRNA models selected to represent each gene. In the past, there was no restriction on the number, as long as each mRNA model differed in intron/exon content. In Build 27, multiple models were retained only if they were supported by RefSeq mRNAs and those RefSeq mRNAs matched the assembly at that region quite well. Multiple models per gene are, in fact, provided as RefSeq NM_000000 accessions. That is because another major change in this build was to discontinue providing model mRNAs as XM_000000 accessions when they matched quite well to existing NM_000000 accessions. Thus, when a RefSeq mRNA sequence (NM_000000) was determined to align to the genome with fewer than 3 gaps and fewer than 4 mismatches, that sequence was retained to represent the mRNA model, and no XM_000000 accession was retained or generated.
more about RefSeq accessions... Thus gene models in Build 27 have been categorized as:
having mRNA and EST evidence having mRNA evidence only having EST evidence only being predicted, but the prediction is overlapped by some ESTs being predicted, but no EST alignment
These categories of genes are represented by various
colors and evidence abbreviations on the Gene_Sequence map.
Map Viewer
Query: It is now possible to retrieve data based on mouse UniGene cluster names, mouse mRNA accessions, and a larger set of human genomic GenBank accessions, because of the addition of the new GenBank DNA and Mouse UniGene maps (See the next section.). These additions now permit a display of the position of these sequences relative to this build without having to doing a BLAST query. Maps: Two new maps are now available:
Display Settings Label URL Scope
UniGene_Mouse UniG_Mm est_mm Mouse mRNAs and ESTs, labeled according the UniGene cluster in which they are found. Query by mouse mRNA accessions and UniGene cluster names.
GenBank DNA gbdna gbdna Human genomic sequences, not used in the assembly process, were aligned to the components of the contigs. An accession is displayed if it shows at least 97% identity to that component, for at least 98 base pairs. It is not displayed if it has a different chromosome assignment. If it extends beyond a contig, the unaligned portion of the sequence is not shown. Thus some alignments relative to the contig may be very short. You can Use the 'hits' link to see a tabular display of what is matched by any such sequence. In the line indicating the position of the aligment, the segment corresponding to the portion of the GenBank sequence that actually aligns is wider.
There are also modifications to existing maps:
Genes_seq
Additional color coding was introduced to make it easier to determine either the type of evidence for a model or the level of confidence in that model. more Components
This map was previously named GenBank. The name was changed to (1) make it clear that the accessions seen on this map were actually used in the assembly process and (2) to make the name more distinct from the new GenBank DNA map. Contig
The contigs are now color-coded to indicate which regions are from draft sequence (orange) and which from finished (blue).
New features
seq link
The link to the download sequence form to retrieve a region of genomic sequence has been made more evident for genes annotated on the NCBI contigs. When Genes_sequence or Genes_cytogenetic is the master, the full (verbose) label display includes a seq link to a form that displays the sequence in two coordinate systems: chromosome and the scaffold/contig represented by the NT_000000 accession (and preset to the coordinates of the gene). This makes it easier to download the genomic sequence including a gene of interest.
Evidence Viewer
The display has been modified to provide an indicator of the density of ESTs along an alignment.
Gene annotation
This build is the first to annotate and provide accessions (format XR_000000) for genes that are transcribed, but do not appear to encode a protein.
Map Viewer
The Gene_Sequence (Genes_seq) now uses color coding to indicate when there is conflicting data about a gene, or the alignment of a defining mRNA is not perfect. Genes that have such conflicts are represented in orange; those with consistent information are blue. Cases which cause the color to be represented as orange include:
Gaps in aligment of the mRNAs used as evidence, or when the region between two aligned regions cannot itself be aligned. Gene annotated based on an alignment of mRNA(s) aligned in more than one location in the genome. These models may therefore represent unknown members of a gene family, or be an indicator of an assembly error in which sequence has not been able to be merged because of insufficient identity.
A partial set of connections between STS markers and phenotypes of OMIM is now included. More details about how to search for and display these connections are provided in the help documention for human Map Viewer.
Assembly
Before assembly, source sequences were divided into sub-chromosomal bins based on genetic and RH mapping data. This was done to prevent false joins that might occur when distant regions within a chromosome have very similar sequence.
Gene annotation
Genes continued to be identified based on alignment of mRNAs (RefSeq and GenBank) but not ESTs. Gene boundaries are based on those aligned models that
share an exon or share an intron or if an alignment with one exon, one is a subset of the other.
A modification with this build relaxed the definition of 'shared exon' to:
ends within 10 bp of each other AND at least 5 bp, AND at least 50% of the exon's length in common.
Identification of the 5' and 3' extents of an aligment became more rigorous. Limits were set on the size of the first and last introns.
The number of model mRNAs (accessions of the form XM_######) and proteins (XP_######) decreased slightly because of the slightly relaxed definition of intron/exon identity defined above. Model mRNAs were not discarded, however, if they were based on the alignment of a RefSeq.
Protein matches to GenomeScan model proteins are now made to all of the nr database.
Map Viewer
Maps: A new map is available: Transcript map. This map displays the position of the XM_###### accessions in chromosome coordinates. Links are provided to the sequence record; the LocusLink record, and the sequence and evidence/alignment viewers. Graphical Display: For the transcript and gene (sequence) maps, orientation of alignment is now indicated by arrows instead of + or -. Also, the strand of the alignment is made clearer by displaying the map object on separate sides of the vertical grey line. Tabular display: More options are now available for controlling the tabular display and downloading the map data. An option to use the evidence viewer has been added within the download sequence function and when either the 'Gene on Sequence' or Transcript maps is being reported.
Assembly
Genomic accessions used in assembly were assigned to chromosomes according to information from chromosome coordinators. If the assignment for a sequence differed from that suggested by mapped STS, the sequence was assembled with the chromosome indicated by overlaps with other sequences.
For chromosome 20, the assembly provided by the Sanger Centre was used.
Gene annotation
Models in this release were based on mRNAs (GenBank and RefSeq, but not ESTs). Alignment criteria were:
initial placement based on sequence identity of >=95% extension based on sequence identity >=80% overall alignment of >1000 nucleotides or 50% of the length.
Only the best alignment in the genome was retained. This is in contrast to the previous build in which only RefSeq mRNAs were used.
For the first time, RefSeq reference gene accessions (format NG_######) were used to apply more detailed annotation by incorporating reference gene accessions in our assembly and annotation processes. See, for example, NG_000004 and its placement in NT_007812.
Map Viewer
Query: It is now possible to query the genome based on a larger set of GenBank mRNA accessions. Connections by accession to genes are based on the file loc2acc that existed at the time of the release. Connections to the UniGene map are based on the mRNA accessions that aligned best at that position. Maps: The previous EST map was replaced by a UniGene map. This map displays alignments of mRNAs and ESTs, labels those alignments based on the dominant UniGene cluster(s) to which the accessions belong, and provides a link to the UniGene record.
New features
Evidence viewer:
Evidence Viewer provides these major functions:
a representation of all exons in a gene annotated in a contig (NT_######)(with an extension of 1500 nucleotides at both ends) a display of mRNAs aligning to that region a summary of sequence differences Imismatches/insertions/deletions) a multiple alignment, with mismatches, insertions, deletions , and coding changes clearly indicated
This display is currently accessed by the
ev provided in the Map Viewer labels for genes, and in the Genome Annotation portion of a
LocusLink report. It can also be accessed directly if you know the contig accession and the gene symbol: Example:
http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/evv.cgi?contig=NT_011519&gene=HIRA