RNA-Seq Based Tracks

JBrowse marks the introduction of RNA-Seq data into RGD's genome browsers. A complete set of BAM alignments between RNA-Seq reads and the RGSC5.0 reference genome sequence are available for a study of Hypoxia in BN and SS rats by the Lazar/Jacob group at MCW. The size of this data makes it impossible to display in GBrowse, whereas JBrowse is far more scalable and able to handle large datasets without slowing down.

The JBrowse for v5.0 also includes a set of gene predictions based on RNA-Seq data with and without the addition of ESTs, from the Liu lab at MCW. These tracks display the gene predictions based on RNA  sequencing of bone marrow, brain and kidney, highlighting differences in expression between these three diverse organs.

Jump to information about...




RNA-Seq BAM Alignment Display

JBrowse BAM alignment tracks display short next-gen RNA sequencing reads aligned to the appropriate reference genome sequence.  Currently, RGD does not do the alignment process.  Rather, researchers can submit their BAM alignment files to RGD which are then loaded into JBrowse as is.

To open alignment tracks, go to the "Available Tracks" list, click "RNA-Seq (BAM alignments)" to open the track category, and select the experiment for which you would like to see data.  In the example to the left, the experiment was "Lazar BN-SS Hypoxia-Normoxia Study", the strain is "SS" and the condition is "Hypoxia".  The track shown is "SS Hypoxia #1 Cardiac Fibroblasts".  (To view a full size image of this example with the "Available tracks" menu also displayed, click here, or to download it click here.)

JBrowse color-codes the blocks denoting sequence alignment based on the alignment's strandedness, and whether its mate pair is present or missing:

Color Name: Color Hex Code: Color Strand aligned to: Mate Pair?
Light Red #EC8B8B forward present
Light Blue #898FD8 reverse present
Dark Red #D11919 n/a absent

A black line between blocks denotes a gapped alignment, i.e. a sequence in which segments of the RNA sequence read aligned is different places with unaligned reference sequence between the alignments.  These often correspond to splice junctions.  In the example to the left, most of the gapped alignments line up with the exons in the genes shown in the track above.

Click here to view this RNA-Seq BAM alignment example in JBrowse.
(Back to top)

RNA-Seq BAM Alignment Context-Specific Menu

In addition to the standard options which appear in the context-specific menus for all tracks, for BAM alignment tracks these menus contain additional options for configuring the display:

  • Hide PCR/Optical DuplicateReads:  When checked, duplicate reads to the same location are not shown. The default is to hide these.

  • Hide reads failing vendor QC:  When checked, reads which did not pass the quality requirements of the aligner used are not shown.  The default is to hide these.

  • Hide reads with missing mate pairs:  When checked, if a read is missing a mate pair or paired-end match, the read is not shown.  The default is to show reads with missing mate pairs but color code the block to indicate that the mate is missing.

  • Hide secondary alignments:  When checked, secondary reads which mapped to multiple locations are not shown. The default is to hide these.

  • Hide supplementary alignments:  Where read alignments cannot be represented as a linear alignment, the alignment is considered "chimeric".  In these cases, one linear alignment from the group is arbitrarily considered "representative" and the rest are considered "secondary". When checked, these secondary alignments are not shown.  The default is to hide the secondary alignments.

  • Hide reads aligned to the forward strand:  When checked, none of the reads from the forward strand are displayed.  The default is to show reads aligned to the forward strand.

  • Hide reads aligned to the reverse strand:  When checked, none of the reads from the reverse strand are displayed.  The default is to show reads aligned to the reverse strand.

For more information about the SAM/BAM format and what some of these options refer to, go to the "Sequence Alignment/Map Format Specification" document (on GitHub).

Click here to view this RNA-Seq BAM alignment example in JBrowse.
(Back to top)

RNA-Seq BAM Alignment Popup

The image to the left shows the aligned RNA sequence to which the popup below it refers.  Note that the read "glyph" is highlighted when you move the cursor over the glyph and the match ID, i.e. the "name" of the aligned segment is diplayed.

Click the segment to bring up a popup with data extracted from the source BAM alignment file.  These include:

Name:  A unique identifier for this aligned sequence

Type:  The type of alignment, in this case "match"

Score:  Mapping Quality score. This score equals -10 log10 Probability{mapping position is wrong}, i.e. minus 10 times the log base 10 of the probability that the mapping position is wrong.

Position:  The genomic position where the sequence read hits on the genomic sequence, going from the first base that aligns to the last base that aligns, including the length of any unaligned segments that are enclosed in the alignment.  In this case, the position is " Chr5:168613773..168613961 (+)".  The alignment includes a segment from 168,613,773 to 168,613,857 indicated by a light red block and another segment that aligns from 168,613,945 to 168,613,961 (the light red block to the right), with a segment of the reference from 168,613,858 to 168,613,944 where the RNA sequence does not align, as indicated by the black line between the blocks.  The strand of the genomic reference sequence to which the RNA sequence aligns is indicated by a "(+)" for forward, as in this case, or "(-)" for reverse.

Sequence and Quality:  The nucleotide present at each position of the alignment displayed above the quality score for that single nucleotide alignment.  The base quality score is -10 log10 Probability{base is wrong}, i.e. minus 10 times the log base 10 of the probability that the base given is wrong.  These same quality scores and nucleotide sequence are also listed separately toward the bottom of the popup.

AS:  Alignment score generated by the aligner.  A "0" value means the information is not available.

CIGAR:  A "string" of letters and numbers giving information about how many nucleotides on the reference sequence that were Matched (M) or Not Matched (N, indicating a skipped sequence from the reference) by nucleotides in that "read" of the sequenced RNA, counted from the alignment's start position to the stop position and shown in order from left to right. In the example here, the CIGAR "string" is 85M88N16M, i.e. beginning at Chr5, base pair position 168,613,773, the alignment consists of 85 nucleotides on the reference sequence which Match, 88 nucleotides on the reference sequence that are Not matched and 16 nucleotides on the reference sequence which Match, ending at base pair position 168,613,961.  Additional options are available.  See the SAM file format for the complete list.

MD:  A string for the mismatched positions in the aligned sequence.  If all positions match this will be the number of aligned nucleotides (i.e. the length of the aligned sequence).

MQ:  Mapping Quality score. It equals -10 log10 Probability{mapping position is wrong}, i.e. minus 10 times the log base 10 of the probability that the mapping position is wrong.

NH:  Number of reported alignments that contains the query in the current record.  "1" indicates that the RNA sequence only aligned once to the reference.

NM:  Edit distance to the reference, including ambiguous bases but excluding clipping

XG, XM, XN, XO, XS, YT:  These are "user defined" tags.  In general, a "0" value means the information is not available.

Duplicate:  Indicates whether or not this read alignment a duplicate of one or more others.

Length on ref:  The length of the reference genome sequence covered by the alignment, including the matched sequences and the segment(s) on the reference that were skipped.

Multi segment all aligned:  If the RNA sequence aligned in multiple segments, were all of those segments successfully aligned to the reference?

Multi segment first:  If the RNA sequence aligned in multiple segments, is this the first of those segments?

Multi segment last:  If the RNA sequence aligned in multiple segments, is this the last of those segments?

Multi segment next segment reversed:  If the RNA sequence aligned in multiple segments, does the next segment in the alignment match the opposite strand relative to this one, i.e. if this segment aligns to the forward strand, does the next align to the reverse strand or vice versa?

Multi segment next segment unmapped:  If the RNA sequence aligned in multiple segments, was the next segment not able to be aligned to the reference sequence?

Multi segment template:  Did the RNA sequence that this refers to (i.e. the one contiguous chunk of sequence that came off the sequencing machine) align to the reference in more than one segment?

Next segment position:  If the RNA sequence aligned in multiple segments, what is the genomic position of the first nucleotide in the matched sequence of the next segment?

Qc failed:  Did this alignment pass or fail quality control?

Qual:  The single nucleotide quality score for each base in the alignment.  These are the same scores listed above with the sequence.

Secondary alignment:  Is this a secondary alignment?

Seq:  The RNA sequence that was aligned.  This is the same sequence listed above with the corresponding single nucleotide quality scores.

Seq length:  The length in nucleotides of the aligned sequence.

Seq reverse complemented:  Is the sequence listed above reverse complemented relative to the reference sequence?

Source:  The name of the BAM file from which the data was derived.

Supplementary alignment:  Is this a supplementary alignment (see the information above regarding the context-specific menu parameter for "Hide supplementary alignments").

Unmapped:  Is this alignment not mapped to the reference assembly?

Click here to view this RNA-Seq BAM alignment example in JBrowse.
(Back to top)

Changing the Track Height

RNA sequencing produces a huge amount of data, not only in terms of the file size for an entire genome's worth of sequence, but in terms of the number of reads that map to a single area of the genome.  This is especially true for highly expressed genes.  Because of this, the default height of a JBrowse track is often too small to be able to show all of the data for an area.  When this happens, the tool displays as much data as can be fit into the maximum track height and then displays a notice that says "Max height reached" above a black line marking the bottom of the track.

It is possible to display more data by changing the setting for the maximum track height using the "Edit config" option in the context-specific menu.  For most users, we do not recommend changing the configuration settings for JBrowse tracks since the results can sometimes be unexpected for the inexperienced user.  However, in this case, the "fix" is simple.  Click the down arrow in the track label to open the context-specific menu.  Click "Edit config" to open the XML configuration file for that track.  (This file only applies to the track that corresponds to the menu from which it was accessed, this does not change the configuration of all tracks.  To change the maximum track height for multiple tracks you must edit the configuration file for each track separately.)

The 4th line of text in this file says:

"maxHeight": 600,

This tells you that the maximum height for this track is currently set to 600 px.  Change the number to your desired track height.  For example, changing "600" to "1200" will double the height.  Make sure you DO NOT remove the comma after the number—it is required for the display to work correctly.  Click the "Apply" button at the bottom of the popup window.  The track height will automatically be readjusted to the new size you have entered.

(Back to top)

RNA-Seq Histogram Display

Because of the volume of data inherent in the results of RNA sequencing, it rapidly becomes useless to try to show individual sequence alignments because there are too many hits in areas of high, or even moderate, expression to be able to show them all—it is not uncommon for there to be hundreds or even thousands of sequences aligning in a single location.  When this happens it is difficult, if not impossible, to see differences in the number of RNA sequencing reads that align.  Because of this, as the display is zoomed out JBrowse automatically changes the display to histograms showing the number of reads aligning to each section of the display.  In other words, the displayed region is split into segments and the number of alignments in each section is counted, converted to a log2 scale [an offset of 1 is added so the display is never 0 when there are counts] and this log2(count) displayed.  For BAM alignment tracks the transition between the normal display and the histogram occurs when the browser window is displaying approximately 20 kb of the chromosome (if your browser window is showing a smaller region than this, you will see the normal display, if larger than this the display changes to histograms).  20 kb is smaller than the transition point at which other tracks, such as those for genes or QTLs, convert to the histogram display because of the differences in the per-segment amount of data.

The example to the left shows the histogram that is displayed when the area around the gene Ucp2 is zoomed out.  Notice that the scale on the right side of the image goes up to 16.  Since this is a logarithmic scale, this does not mean that where the peaks go up to 10 there are only 10 reads which align in that segment.  To determine the approximate number of reads aligned in that segment use the equation

numReadsAligned = 2^(histogramHeight – 1)

A value of 10 means that 2^9, or 512, reads aligned in that segment.  The maximum value of the scale, 16, corresponds to approximately 65,000 reads aligning to that segment.

Since the histogram display does not show individual data elements, but rather shows calculated values derived from groups of elements, there are no popups for the histogram display.

Click here to view this RNA-Seq histogram example in JBrowse.
(Back to top)

RNA-Seq Gene Prediction Tracks

RGD accepts data submitted by researchers and can display such data in JBrowse once the researchers "okay" it to be publicly available.  One such set of tracks is the RNA-Seq-based gene prediction data submitted by the Liu lab at the Medical College of Wisconsin.  These researchers used RNA-Seq data from brain, bone marrow and kidney, with and without the inclusion of EST data, to do genome annotation of the rat v5.0 assembly, including prediction of known and novel transcripts/isoforms and prediction of novel genes.  These gene predictions are shown in the Gene Models-->RNA-Seq Predicted Gene Models-->Cancer Center, Medical College of Wisconsin tracks.  As shown in the example to the left, where the RNA-Seq and/or EST data supported prediction of a gene with its intron/exon structure, this structure is displayed.  If a gene that is displayed in the RGD Genes track is not displayed in an RNA-Seq gene prediction track, either that gene is not expressed in the corresponding tissue, or the RNA sequencing in that region was not of sufficient quality or quantity to support a gene prediction.  Please note that the absence of a gene prediction at a specific location in a specific tissue cannot be taken as proof that the corresponding gene is not expressed in that tissue.

Click here to view the RNA-Seq Gene prediction track example in JBrowse.

More information about the data in these tracks can be found in the paper Improved rat genome gene prediction by integration of ESTs with RNA-Seq information. Li et al, Bioinformatics. 2015 Jan 1;31(1):25-32.  PMID: 25217576 http://www.ncbi.nlm.nih.gov/pubmed/?term=25217576  For convenience, a link to the PubMed record for this article is available in the "About this track" information accessible from the track title's drop-down menu as shown in the image to the left.  For questions about this data or how it was derived, please contact the authors of the paper.

Popups for RNA-Seq gene predictions contain the information submitted by the researchers.  These include the source which gives the tissue and method from which the data was derived, the genomic location (with a link which opens JBrowse at that specific position), the length of the predicted gene/transcript, the "name" or designation of the predicted gene/transcript as submitted by the researcher, and arbitrary identifiers assigned by the computer program which loaded the data into the genome browser originally, the "Load ID" and the "Primary ID".  Note that the name of the predicted gene/transcript is a unique identifier assigned by the researchers to that transcript from that tissue and does not indicate what known gene, if any, that predicted gene could be—for example, as shown in the examples to the left a transcript in brain is located at the same genomic position and has the same basic intron/exon structure as the known gene C2cd3, but the name of that predicted transcript is g2062, not C2cd3.