Variant Tracks

RGD stores and presents several kinds of variant data.  Researchers who have done whole genome sequencing on specific rat strains submit their variant calls for those strains to RGD.  This variant data is loaded into the RGD Variant Visualizer tool, and the GBrowse and JBrowse genome browsers for searching and visualization.  In addition, RGD imports single nucleotide variation (SNV) and simple sequence length polymorphism/micro satellite marker (SSLP) data from NCBI (SNVs and SSLPs) and Ensembl (SNVs).  Finally, researchers doing genetic analyses for QTLs or constructing congenic strains who have discovered and used SSLPs and SNVs as markers in their studies often submit these to RGD.  All of these types of variants are represented in the JBrowse variant tracks.

 

Jump to information about...

 

 

 

Strain Specific Variant Tracks

Most strain specific variant tracks show single nucleotide variants (SNVs), although in some JBrowse instances there are also tracks for indels (see below).  To view variants in your strain of interest, select "Variants" in the "Available Tracks" list, choose "Strain Specific Variants", then select one or more strains from the list.  Strains are listed as the symbol of the specific substrain that was sequenced followed in parentheses by an abbreviation for the group/institution that carried out the sequencing and/or analysis, including the calling of variants relative to the reference sequence of the assembly covered by the version of JBrowse which you are viewing (i.e. in this example, strains include "WKY/N (KNAW)" and "ACI/Eur (MCW)").  Note that in some cases, the different substrains of the same strain have been sequenced, allowing comparisons between them.  In the rat v3.4 JBrowse, you will also find cases where the same substrain has been sequenced by two different groups, showing how differences in the sequencing and in the software used for analysis may result in substantial differences in the variants called in some regions.

RGD strain specific variant popups display the same information in the same format as the variant detail display in the Variant Visualizer tool.  This information includes:

Strain:  The symbol of the specific substrain that was sequenced followed in parentheses by an abbreviation for the group/institution that carried out the sequencing and/or analysis.

Position:  The genomic position of the variant in the format "Chromosome: [number] - [nucleotide position]."

Reference Nucleotide:  The nucleotide at that position in the reference sequence of that assembly.

Variant Nucleotide:  The nucleotide(s) predicted at that position by sequencing of that strain.  If more than one variant allele was found at that location in that strain sequence, all possible variants will be listed, separated by commas

Location:    "GENIC" if the variant overlaps the position of a gene, "INTERGENIC" if the variant does not overlap the position of a gene.  Information about the location of a genic variant within a given transcript is found in the transcript section below the header of the popup.

Zygosity:  If available, this gives the estimated zygosity of the variant.  Possible values in this field are:

  • Heterozygous:  2 alleles called between 15% and 85% of reads
  • Homozygous:  Variant read in 100% of reads
  • Possibly Homozygous:  Variants read in 85% to 99% of reads

Related Variants:  If this variant appears in NCBI's SNP database (dbSNP), the RS ID of the corresponding dbSNP record appears in this field.

Conservation:  Where available, the conservation score of this nucleotide in the genomic sequence, as imported from the UCSC database is shown here.  The conservation score is a number between 0 and 1, with the higher the number the greater the conservation across species.  If the conservation score is not available in JBrowse, this will be "n/a"

Total Depth:  The read depth at this position from the sequencing data.  Higher denotes more sequencing reads which aligned to that sequence in the reference genome.

% Variant Reads:  The ratio between the number of sequencing reads at this position that were called as a variant and the number of sequencing reads at this position that were called as matching the reference nucleotide at this position, times 100.

Total Alleles Read:  The number of different nucleotide assgnments at that position in that sequence.  For example, if some reads were called as a variant and others called as the reference nucleotide, this will be 2, whereas if all of the reads were called as the variant, this will be 1.

VID:  The variant ID, a unique, arbitrary number assigned to the variant when it was loaded into the RGD genome browsers

Transcript information:  For genic variants, information about the transcript(s) they overlap is given, including:

  • Accession:  The RefSeq accession number of the transcript mRNA/cDNA sequence
  • Location:  The location of the variant within the transcript, e.g. INTRON, EXON, 5UTR (i.e. within the 5' UTR), etc.
For variants which fall within protein coding exons, additional information is given:
  • Amino Acid Prediction:  The amino acid change which is predicted to occur because of the variation in the nucleotide sequence.  For synonymous variants the amino acids are the same, e.g. "I to I (synonymous)".  For non-synonymous variants, the same format is followed but the two single-letter amino acid designations will be different, in this case, "V to I (nonsynonymous)".
  • Polyphen Predictions:  Where available, the Polyphen prediction of the consequence of the non-synonymous amino acid change is given.  For more about PolyPhen output, see http://genetics.bwh.harvard.edu/pph2/dokuwiki/appendix_a.
  • Amino Acid Sequence:  Since variants are called relative to the genomic reference sequence, RGD calculates the protein sequence that would be produced from that reference genomic sequence, based on the chromosomal position of that transcript's translation start site and the start and stop positions of the protein coding exons.  This amino acid sequence is given with only the amino acid affected by this variant substituted.  The substituted amino acid is shown in bold and red.  In the example here, the substituted amino acid, an "I", is toward the right end of the third line of the sequence.

Click here to view this strain specific variant example in JBrowse.


As mentioned in the previous section, since variants are called relative to the genomic reference sequence, the protein sequences that appear in the strain-specific variant popups are informatically translated from that reference sequence (e.g. RGSC 3.4 or RGSC 5.0). The position of the transcript for a gene and its intron/exon structure are used to produce a putative "reference" RNA sequence from which the protein is translated.  In some instances, these derived protein sequences do not match the published protein sequences from RefSeq or UniProt. In such cases, a warning message appears above the protein sequence stating that there could be problems with the transcript.  Note in the case to the left, the translated sequence derived from the genomic reference sequence contains two stop codons (denoted as asterisks in the protein sequence, highlighted by red arrows) in the middle of the putative peptide sequence.
Click here to view this example of a problematic transcript in JBrowse.
(Back to top)

Indel Tracks:

The RGSC v3.4 JBrowse includes a number of strain specific variant tracks for small "indels" in addition to tracks for SNVs.  "Indel" here is used to denote any insertion, deletion or combination thereof which results in a net change in the number of nucleotides in the sequence relative to the corresponding reference sequence.  The variants in these tracks are small, i.e. between one and a few hundred nucleotides, as opposed to structural variations or larger insertions, deletions and rearrangements.  As such, these insertions and deletions are generally short enough that when the JBrowse display is zoomed out to a gene level (or even sufficient zoom to view an entire single exon in many cases) that they appear as vertical lines.  If you look closely, you will see that the glyphs are color coded, with deletions depicted in red and insertions shown in blue. 
This becomes more obvious if you zoom in to view approximately 200 nucleotides or less.  At this level a deletion is depicted as a red bar covering the number of nucleotides deleted. An insertion is a blue bar one nucleotide long—marking the nucleotide after which the insertion occurs.  In the example to the left where the zoom level renders about 35 base pairs across the width of the window, the deletion (left) is 7 base pairs long versus the insertion (right) where only one nucleotide is marked.

 

The examples to the left are the popups for the deletion and insertion shown in the previous figure.  The popups for indels are different than those for SNVs.  They contain the following information:

Source:  The source of the indel and the information about it, including the strain and the group or institution that did the analysis.

Class:  The type of data, in this case, "INDEL"

Genome Location:  The genomic position of the insertion or the deletion given in the format Chr:Start..Stop.

Feature Length:  "Feature Length" is derived from the listed genomic location, so for deletions, this is the length of the deletion, i.e. the number of base pairs that were removed.  For insertions, it is one base pair regardless of the number of base pairs inserted because the genomic location is the one base pair location on the reference sequence where the insertion occurred.

Variant ID:  The variant ID is a combination of the position of the first inserted or deleted base pair plus the indel notation given as the (sequence of the reference allele->the sequence of the alternative allele).  For insertions, the reference allele is empty, i.e. depicted as a dash (-).  For deletions, the alternative allele is shown as a dash.

Variant type:  Whether the variant is an insertion or deletion

Reference:  The sequence of the reference allele.  For insertions, the reference allele is empty, i.e. depicted as a dash (-).  For deletions, the reference allele is the sequence of nucleotides that were removed.  In the example here, that is the 7 nt's "TCTAAAG".

Strain:  The strain that was sequenced.

Filtered depth:  The sequence depth after removal of reads that did not pass the quality control.

Variant call analysis:  Gives the sequence of the variant and the number of reads in which the variant allele was found. For deletions, the variant allele sequence is empty.

Click here to view this indel example in JBrowse.

(Back to top)

dbSNP and Ensembl Variant Tracks

RGD imports variant data from both NCBI (dbSNP) and Ensembl to present in our genome browsers for comparison to other data therein.  Variants are generally represented as vertical line glyphs in JBrowse.  Zoom in to view a genomic region of less than approximately 200 bp to see variants as wider bars.

Both dbSNP and Ensembl variants are color-coded according variant type.  Click here to view a pdf file with more information about the variant types and the colors used to designate those. (Note that "SO" in this file is the Sequence Ontology.  Click here to search or browse the ontology.)

The ID of the variant appears below the variant glyph.  Click the glyph or the ID to view the popup for more information about that variant.

dbSNP Variant Popup:

Source:  An indicator of where the variant data came from.
Class:  The type of variant, in this case, single nucleotide polymorphism or "SNP"

Genome Location: The genomic position of the variant in the format "Chr:Start..Stop"

Feature Length:  The base pair length of variant.  In this case, since this is a SNP, the length is 1 bp.

ID:  The RS or "Reference SNP" ID of this variant. For more information about reference SNPs, see http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit#REFSNP or http://www.ncbi.nlm.nih.gov/books/NBK21088/#ch5.ch5_s4.

Accession:  A unique identifier assigned to the SNP when it was loaded into the RGD genome browsers.

Map Weight:  How many times a variant maps to a given contig sequence, for example if the variant only maps to a single location this will be "unique-in-contig"

Type:  The transcript-level location (such as "intron-variant") of the variant, or the type of coding change resulting from substituting the reference nucleotide with the variant, for example "missense".  Click here to view a pdf file with more information about the variant types and the colors used to designate those.

Allele:  The possible alleles for that variant, listed as "reference_nucleotide/variant_nucleotide".

Ensembl Variant Popup:

Source:  An indicator of where the variant data came from.

Class:  The type of variant, in this case, single nucleotide polymorphism or "SNP"

Genome Location:  The genomic position of the variant in the format "Chr:Start..Stop"

Feature Length:  The base pair length of variant.  In this case, since this is a SNP, the length is 1 bp.

ID:  The unique identifier assigned to that SNP in the incoming data file.  Since SNP data is shared between Ensembl and NCBI this is often the RS or "Reference SNP" ID of this variant.  For more information about reference SNPs, see http://www.ncbi.nlm.nih.gov/SNP/get_html.cgi?whichHtml=how_to_submit#REFSNP or http://www.ncbi.nlm.nih.gov/books/NBK21088/#ch5.ch5_s4.

Strain(s):  Where available, the symbol and RGD ID of the strain or strains in which the variant was found is listed.

Map Weight:    How many times a variant maps to the corresponding reference sequence

Type:  The transcript-level location of the variant (such as "intron_variant"), or the type of coding change resulting from substituting the reference nucleotide with the variant, for example "missense_variant".  Click here to view a pdf file with more information about the variant types and the colors used to designate those.

Allele:  The possible alleles for that variant, listed as "reference_nucleotide/variant_nucleotide".

Click here to view this example of dbSNP and Ensembl SNP tracks in JBrowse.
(Back to top)

Marker/SSLP Tracks

Before the advent of whole genome sequencing, researchers used a variety of "markers" to localize genomic and genetic elements such as Quantitative Trait Loci (QTLs) in relation to each other, to known genes, and to the chromosome.  Often these markers were Simple Sequence Length Polymorphisms or SSLPs.  An SSLP, or microsatellite, is defined as 1-6 simple nucleotide repeat sequences which are polymorphic in length (i.e. in the number of times the sequence is repeated in a specific location) among strains or between individuals and can be used as genetic markers for genotyping.  Although the rat, mouse and human genomes have now been sequenced and whole genome assemblies are available, markers such as SSLPs provide an easy way to localize genomic elements.

Each SSLP in RGD is defined by a set of polymerase chain reaction (PCR) primer sequences.  Since PCR is routine in most laboratories, such markers are still used for marking the ends of QTLs and the introgressed DNA regions of congenic strains.

Markers are represented in JBrowse as vertical lines until the zoom level is increased enough for these short DNA segments to be displayed as dark green horizontal bars (zooming out results in the marker display changing from showing individual markers to showing a histogram of marker density across the displayed region when the region displayed is greater than 45 Mb).  A label showing the marker symbol is displayed beneath the marker glyph in the format "SSLP:Marker_Symbol".  Click the glyph or the label to view a popup with more information about that marker.

For more about marker records in RGD, go to the RGD marker help pages.

Marker/SSLP Track Popup:

Source:  The source of the SSLP data.  SSLP data is extracted from the RGD database so this is RGD.  For information about the original submitter of the data and/or any publications which reference the discovery of that marker, please go to the RGD Marker report (see below).

Class:  The data type, in this case, SSLPS.

Genome Location: The genomic position of the variant region in the format "Chr:Start..Stop".  Marker locations are based on the genomic locations where the PCR primers that define the marker "hit" (i.e. based on ePCR of the primers.  For more information about ePCR, see http://www.ncbi.nlm.nih.gov/tools/epcr/).  As such the location includes the flanking regions surrounding the sequence segment which actually varies in size.

Feature Length:  The genomic size of the region including the variant and its flanking regions, from the genomic position of the most 5' nucleotide of the forward primer to the genomic position of the most 3' nucleotide of the reverse primer.  This would be the predicted size of the PCR product based on the corresponding reference sequence.

Name:  The official symbol of the marker.

RGD Marker Report:  The RGD ID of the marker record.  This links to the RGD marker report for additional information about the marker, including references and any congenic strains and/or QTLs for which it was used.

Expected Size:  The size of the marker (i.e. the size of the PCR product) as originally reported by the research group that characterized it.  Due to differences between genomic assemblies and/or differences between strains, this may or may not be the same as the feature length.

Click here to view this example of the SSLP variant track in JBrowse.
(Back to top)