Visualization of genomic data using Gviz package

Gviz package introduction

Introduction to Gviz software package

The Gviz package aims to provide a structured visualization framework to draw any type of data along genomic coordinates. It also allows the integration of public genome annotation data from sources such as UCSC or ENSEMBL.

Like most genome browsers, individual types of genome features or data are represented by separate tracks.

By default, Gviz checks the validity of all provided chromosome names on UCSC (chromosomes must start with chr string).
You can decide to turn off this function by calling options(ucscChromosomeNames=FALSE)

In the following example, the UCSC Genome and chromosome 7 (chr7) on the mouse mm9 genome will be utilized

Drawing an AnnotationTrack diagram

Note that the AnnotationTrack constructor can accommodate many different types of input.

For example, the start and end coordinates of the annotation function can be used as a single parameter
start and end as data Frame or even as IRanges or GRangesList objects.

library(Gviz)
library(GenomicRanges)
#Load data: class = grades
data(cpgIslands)
cpgIslands

## GRanges object with 10 ranges and 0 metadata columns:
##        seqnames            ranges strand
##           <Rle>         <IRanges>  <Rle>
##    [1]     chr7 26549019-26550183      *
##    [2]     chr7 26564119-26564500      *
##    [3]     chr7 26585667-26586158      *
##    [4]     chr7 26591772-26593309      *
##    [5]     chr7 26594192-26594570      *
##    [6]     chr7 26623835-26624150      *
##    [7]     chr7 26659284-26660352      *
##    [8]     chr7 26721294-26721717      *
##    [9]     chr7 26821518-26823297      *
##   [10]     chr7 26991322-26991841      *
##   -------
##   seqinfo: 1 sequence from hg19 genome; no seqlengths

#Annotation track, titled "CpG"
atrack <- AnnotationTrack(cpgIslands, name = "CpG")
p<-plotTracks(atrack)

Add genome coordinates

## Genome coordinates
gtrack <- GenomeAxisTrack()
plotTracks(list(gtrack, atrack))

Add ideograph

To add a chromosome ideogram, we must specify a valid UCSC Genome (e.g. "hg19") and a chromosome name (e.g. "chr7")

Because this function obtains data from UCSC, an Internet connection is required, which may take a long time.

# genome : "hg19" 
gen<-genome(cpgIslands)
# Chromosme name : "chr7"
chr <- as.character(unique(seqnames(cpgIslands)))
# Ideogram track
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
plotTracks(list(itrack, gtrack, atrack))

Ideogram tracks are an exception to all Gviz track objects because they are not really displayed in the same coordinate system as all other tracks.
Instead, the current genome position is represented by a red box on the chromosome (or, in this case, a red line if the width is too small to accommodate the box).

Add gene model

We can use existing gene model information from local sources. Alternatively, you can download such data from one of the available online resources, such as UCSC or ENSEBML, and have constructors to handle these tasks.

For this example, we will start from the stored data Frame loads gene model data.

# Load data
data(geneModels)
head(geneModels)

# Drawing
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "Gene Model")
plotTracks(list(itrack, gtrack, atrack, grtrack))

Enlarge drawing

We often want to zoom in or out of a specific drawing area to see more details or a broader overview.

# Scale using the from and to parameters
plotTracks(list(itrack, gtrack, atrack, grtrack),
           from = 26700000, to = 26750000)

# Use extend Left and extend Right to zoom
# These parameters are relative to the current display range,
# And can be used to quickly expand the view at one or both ends of the drawing.
plotTracks(list(itrack, gtrack, atrack, grtrack),
           extend.left = 0.5, extend.right = 500000)

# Delete the boundary of exon to better show the picture
plotTracks(list(itrack, gtrack, atrack, grtrack),
           extend.left = 0.5, extend.right = 500000, col = NULL)

A value of 0.5 zooms in to half the current display range

Add a sequence track and zoom in to see the sequence

The necessary sequence information comes from the BSgenome package

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")

library(BSgenome.Hsapiens.UCSC.hg19)
strack <- SequenceTrack(Hsapiens, chromosome = chr)
plotTracks(list(itrack, gtrack, atrack, grtrack,
                strack), from = 26591822, to = 26591852, cex = 0.8)

Add data track

DataTrack objects are essentially run length coded (Rle) digital vectors or matrices that we can use to add various digital data to our genome coordinate map.

Different visualization options for these track s are available, including point diagram, histogram and box whisker diagram representation.

# For demonstration purposes, we can create a simple DataTrack object from randomly sampled data
set.seed(255)
lim <- c(26700000, 26750000)
coords <- sort(c(lim[1], sample(seq(from = lim[1], 
                                    to = lim[2]), 99), lim[2]))
dat <- runif(100, min = -10, max = 10)
head(dat)

##data track
dtrack <- DataTrack(data = dat, start = coords[-length(coords)],
                    end = coords[-1], chromosome = chr, genome = gen,
                    name = "Uniform")

# Draw data track diagram
plotTracks(list(itrack, gtrack, atrack, grtrack,
                dtrack), from = lim[1], to = lim[2])

# Change the plot type to histogram
plotTracks(list(itrack, gtrack, atrack, grtrack,dtrack),
           from = lim[1], to = lim[2], type = "histogram")

This visualization is particularly useful when displaying, for example, the coverage of NGS reads along chromosomes, or displaying the measurements of mapping probes from microarray experiments.

Drawing parameters

Set parameters

#Annotation of transcript
# Change panel and title background color
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "Gene Model", 
                           transcriptAnnotation = "symbol",
                           background.panel = "#dbeeff",
                           background.title = "darkblue")
plotTracks(list(itrack, gtrack, atrack, grtrack))

Drawing direction

By default, all tracks are drawn in the 5 '- > 3' direction. Sometimes it is useful to actually display data relative to the opposite chain.

plotTracks(list(itrack, gtrack, atrack, grtrack),
           reverseStrand = TRUE)

The data has been plotted on the reverse chain, which is also reflected in the GenomeAxis track.

Track classes

Several parameters can be used to change the appearance of different tracks.

Display parameters of GenomeAxisTrack

Set the position of the label to the bottom, display the ID, and change the color

axisTrack <- GenomeAxisTrack(range = IRanges(start = c(2000000,4000000), end = c(3000000, 7000000), names = rep("N-stretch", 2)))
plotTracks(axisTrack, from = 1000000, to = 9000000, 
           labelPos = "below",showId=TRUE, col="#9400D3")

IdeogramTrack

#Ideogram
ideoTrack <- IdeogramTrack(genome = "hg19", chromosome = "chrX")
plotTracks(ideoTrack, from = 85000000, to = 129000000)

#Display chromosome band ID
plotTracks(ideoTrack, from = 85000000, to = 129000000,
           showId = FALSE, showBandId = TRUE, cex.bands = 0.4)

DataTrack

In essence, they constitute a run length encoded digital vector or matrix associated with a specific genomic coordinate range.

There can be multiple samples in a single dataset. The drawing method provides a tool for merging sample group information.

Therefore, the starting point for creating a DataTrack object will always be a set of ranges, in the form of IRanges or GRanges objects, or separately as start and end coordinates or widths. The second component is a numerical vector with the same length as the number of ranges, or a numerical matrix with the same number of columns.

Next, we will load our sample data from the ranges object that is part of the Gviz package

# Load data
data(twoGroups)
head(twoGroups)

# data track diagram (the default visualization of DataTrack is point diagram)
 dTrack <- DataTrack(twoGroups, name = "uniform")
 plotTracks(dTrack)

Different picture types

Possible drawing types are:

# Point diagram
plotTracks(DataTrack(twoGroups, name = "p"), type="p")
# Line diagram
plotTracks(DataTrack(twoGroups, name = "l"), type="l")
# Line and point diagrams
plotTracks(DataTrack(twoGroups, name = "b"), type="b")
# Average chart
plotTracks(DataTrack(twoGroups, name = "a"), type="a")
# Straight line diagram
plotTracks(DataTrack(twoGroups, name = "h"), type="h")
# Histogram (bar width equals range)
plotTracks(DataTrack(twoGroups, name = "histogram"), type="histogram")
# 'polygon type' diagram
plotTracks(DataTrack(twoGroups, name = "polygon"), type="polygon")
# Box whisker diagram
plotTracks(DataTrack(twoGroups, name = "boxplot"), type="boxplot")
# Color image of each value
plotTracks(DataTrack(twoGroups, name = "heatmap"), type="heatmap")

DataTrack diagram example:

# Combine boxplot with mean line and data grid (g):
plotTracks(dTrack, type = c("boxplot", "a", "g"))
# Heat map and display sample name
plotTracks(dTrack, type = c("heatmap"), showSampleNames = TRUE,
           cex.sampleNames = 0.6)

Data grouping

You can use factor variables to group individual samples together.

plotTracks(dTrack, groups = rep(c("control", "treated"), each = 3),
           type = c("a", "p"), legend=TRUE)
# Box diagram
plotTracks(dTrack, groups = rep(c("control", "treated"), each = 3),
           type = "boxplot")

# Aggregation group, which can be average, median, extreme, sum, minimum and maximum
plotTracks(dTrack, groups = rep(c("control", "treated"),each = 3),
           type = c("b"), aggregateGroups = TRUE,
           aggregation = "max")

Building DataTrack objects from files

The DataTrack class supports the most common file types, such as wig, bigWig, bedGraph, and bam files

bgFile <- system.file("extdata/test.bedGraph", package = "Gviz")
dTrack2 <- DataTrack(range = bgFile, genome = "hg19",
                     type = "l", chromosome = "chr19", name = "bedGraph")
plotTracks(dTrack2)

Note that the Gviz package uses the functionality in the rtracklayer package for most file import operations

Files in the Gviz package support streaming from index files.
Only relevant parts of the data need to be loaded during the drawing operation, so the underlying data file may be very large without reducing performance or causing too much memory consumption.

We will use the small bam file provided with the package here to illustrate this function.

bam files contain alignment of sequences with common references. The most natural representation of such data in DataTrack is to view only the alignment coverage of a given location and encode it in a single elementMetadata column.

bamFile <- system.file("extdata/test.bam", package = "Gviz")
dTrack4 <- DataTrack(range = bamFile, genome = "hg19",
                     type = "l", name = "Coverage", window = -1, chromosome = "chr1")
plotTracks(dTrack4, from = 189990000, to = 190000000)

AnnotationTrack

Essentially, they consist of one or more genomic ranges that can be grouped into composite annotation elements if necessary.

The necessary building blocks are range coordinates, chromosomes, and genome identifiers.

Information can be passed to functions in the form of separate function parameters, such as IRanges, GRanges, or data Frame object.

aTrack <- AnnotationTrack(start = c(10, 40, 120),
                          width = 15, chromosome = "chrX", strand = c("+","*", "-"),
                          id = c("Huey", "Dewey", "Louie"),
                          genome = "hg19", name = "foo")
plotTracks(aTrack)

Building AnnotationTrack objects from files

The default import function reads the coordinates of all sequence alignments from the bam file.

#Annotation track
aTrack2 <- AnnotationTrack(range = bamFile, genome = "hg19",
                           name = "Reads", chromosome = "chr1")
plotTracks(aTrack2, from = 189995000, to = 190000000)

We can now draw DataTrack and AnnotationTrack representations of bam files at the same time to prove that the basic data is indeed the same.

plotTracks(list(dTrack4, aTrack2), from = 189990000,
           to = 190000000)

GeneRegionTrack

GeneRegionTrack is very similar to AnnotationTrack in principle.

The only difference is that they are more gene / transcript centric. We need to pass the start and end position (or width) of each annotation feature in the track, and provide exon, transcript and gene identifiers for each item, which will be used to create transcript groups.

A somewhat special case is to build GeneRegionTrack directly from one of the TranscriptDb. This option is described in more detail below.

data(geneModels)
grtrack <- GeneRegionTrack(geneModels, genome = gen,
                           chromosome = chr, name = "foo", 
                           transcriptAnnotation = "symbol")

Building a GeneRegionTrack object from fTranscriptDbsx

The GenomicFeatures package provides a framework for downloading gene model information from online resources and storing it locally in the SQLite database.

One advantage of building GeneRegionTracks from TranscriptDb is that we get additional information about the coding and non coding regions of the transcript, that is, the coordinates of the 5 'and 3' UTR and CDS regions.

library(GenomicFeatures)
samplefile <- system.file("extdata", "UCSC_knownGene_sample.sqlite",
    package = "GenomicFeatures")
txdb <- loadDb(samplefile)
txTr <- GeneRegionTrack(txdb, chromosome = "chr6", start = 300000, end = 350000)
#feature(txTr)
plotTracks(txTr)

BiomartGeneRegionTrack

BiomartGeneRegionTrack, which provides a direct interface to ENSEMBL Biomart service.

We only need to input the genome, chromosome and the start and end positions on the chromosome. The constructor BiomartGeneRegionTrack will automatically contact ENSEMBL to obtain the necessary information and dynamically construct the gene model.

An Internet connection is required to work, and depending on usage and network traffic, contacting Biomart may take a lot of time.

biomTrack <- BiomartGeneRegionTrack(genome = "hg19",
                                    chromosome = chr, start = 20000000, end = 21000000,
                                    name = "ENSEMBL")
plotTracks(biomTrack, col.line = NULL, col = NULL)

Sequence Track

library(BSgenome.Hsapiens.UCSC.hg19)
sTrack <- SequenceTrack(Hsapiens)
#sequence track : add 5'->3'
plotTracks(sTrack, chromosome = 1, from = 20000,to = 20050,
           add53=TRUE)
#The complement
plotTracks(sTrack, chromosome = 1, from = 20000,to = 20050,
           add53=TRUE, complement = TRUE)

AlignmentsTrack

The graph of aligned sequences, usually from next-generation sequencing experiments, may be very useful, such as when visually checking the effectiveness of what is called SNP. These comparisons are usually stored in BAM files.

RNA-seq

In this demonstration, let's use a small BAM file in which paired NGS reads have been mapped to extracts of the human hg19 genome.

The data are from RNA SEQ experiment. The comparison is carried out using STAR, and gaps are allowed.

Some gene annotation data of this region were also downloaded from Biomart.

afrom=2960000
ato=3160000
#bam file
alTrack <- AlignmentsTrack(system.file(package = "Gviz", "extdata", "gapped.bam"), isPaired = TRUE)
bmt <- BiomartGeneRegionTrack(genome = "hg19", chromosome = "chr12",
                              start = afrom, end = ato, filter = list(with_ox_refseq_mrna = TRUE),
                              stacking = "dense")
plotTracks(c(bmt, alTrack), from = afrom, to = ato, chromosome = "chr12")

The figure above shows the overall layout of the track

At the top, we have a panel that displays the read coverage information in the form of a histogram. Below this panel, there is a stacked view of individual reads.
Only a certain space is available for drawing, and the whole stacking depth cannot be displayed here. The rest is indicated by the white down arrow in the title Panel.
We can solve this problem by using the max.height, min.height or stackHeight display parameters, which control the height or vertical spacing of stacked reads. Alternatively, we can reduce the size of the coverage area by setting the coverageHeight or minCoverageHeight parameters.

plotTracks(c(bmt, alTrack), from = afrom, to = ato,
           chromosome = "chr12", min.height = 0, coverageHeight = 0.08,
           minCoverageHeight = 0)

So far, pile ups are not particularly useful. We can turn them off by setting the type display parameters accordingly.

plotTracks(c(alTrack, bmt), from = afrom, to = ato, chromosome = "chr12", type = "coverage")

Zoom in further to see the details of the stacked part:

plotTracks(c(bmt, alTrack), from = afrom + 12700,
           to = afrom + 15200, chromosome = "chr12")

The direction of a single read is indicated by an arrow
read pairs are connected by bright gray lines
The gaps are displayed by a connected dark gray line
On devices that support transparency, we can also see that some reads actually overlap.

As mentioned earlier, we can control whether data should be treated as double ended data or single ended data by setting the isPaired parameter in the constructor.

Here is how we view the data in the same file in single ended mode

alTrack <- AlignmentsTrack(system.file(package = "Gviz",
                                       "extdata", "gapped.bam"), isPaired = FALSE)
plotTracks(c(bmt, alTrack), from = afrom + 12700,
           to = afrom + 15200, chromosome = "chr12")

DNA-seq

In order to better display the sequence variation characteristics of alignmenttrack, we will load different data sets, this time from genome-wide DNA SEQ SNP calling.

Similarly, the reference genome was hg19 and was compared using Bowtie2.

We need to specify alignmenttrack about the reference genome (sequenceTrack).

afrom <- 44945200
ato <- 44947200
alTrack <- AlignmentsTrack(system.file(package = "Gviz","extdata", "snps.bam"), isPaired = TRUE)
plotTracks(c(alTrack, sTrack), chromosome = "chr21", from = afrom,to = ato)

The mismatched bases are displayed in the form of superimposed histogram on the stacking part and a single read in the overlay.

When magnified to one of the obvious heterozygous SNP positions, more details can be revealed.

# enlarge
plotTracks(c(alTrack, sTrack), chromosome = "chr21", from = 44946590, to = 44946660)

# Show single letter
plotTracks(c(alTrack, sTrack), chromosome = "chr21",
           from = 44946590, to = 44946660, cex = 0.5, min.height = 8)

Track highlight and overlay

highlight

#highlight
ht <- HighlightTrack(trackList = list(atrack, grtrack,dtrack),
                     start = c(26705000, 26720000), width = 7000, chromosome = 7)
plotTracks(list(itrack, gtrack, ht), from = lim[1], to = lim[2])

superposition

For some applications, it makes sense to overlay multiple track s on the same area of the drawing.

For example, we will generate a second DataTrack object and combine it with the existing objects in Chapter 2.

#create data
dat <- runif(100, min = -2, max = 22)
dtrack2 <- DataTrack(data = dat, start = coords[-length(coords)],
                     end = coords[-1], chromosome = chr, genome = gen,
                     name = "Uniform2", groups = factor("sample 2",levels = c("sample 1", "sample 2")),
                     legend = TRUE)
displayPars(dtrack) <- list(groups = factor("sample 1",levels = c("sample 1", "sample 2")), legend = TRUE)
ot <- OverlayTrack(trackList = list(dtrack2, dtrack))
ylims <- extendrange(range(c(values(dtrack), values(dtrack2))))
plotTracks(list(itrack, gtrack, ot), from = lim[1], to = lim[2], ylim = ylims, type = c("smooth", "p"))

On devices that support it, alpha blending is a useful tool to sort out more information from track coverage, at least when comparing a small number of samples.

The resulting transparency effectively eliminates the problem of over drawing. The following example is only valid if this inset is built on a system with transparency blending support.

displayPars(dtrack) <- list(alpha.title = 1, alpha = 0.5)
displayPars(dtrack2) <- list(alpha.title = 1, alpha = 0.5)
ot <- OverlayTrack(trackList = list(dtrack, dtrack2))
plotTracks(list(itrack, gtrack, ot), from = lim[1],
           to = lim[2], ylim = ylims, type = c("hist"), window = 30)

reference resources

Gviz http://www.bioconductor.org/packages/release/bioc/html/Gviz.html

https://github.com/ivanek/Gviz

Hahne, Florian, and Robert Ivanek. 2016. "Statistical Genomics: Methods and Protocols." In, edited by Ewy Mathé and Sean Davis, 335–51. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4939-3578-9_16.

Katz, Yarden, Eric T. Wang, Jacob Silterra, Schraga Schwartz, Bang Wong, Helga Thorvaldsdóttir, James T. Robinson, Jill P. Mesirov, Edoardo M. Airoldi, and Christopher B. Burge. 2015. "Quantitative Visualization of Alternative Exon Expression from Rna-Seq Data." Bioinformatics. https://doi.org/10.1093/bioinformatics/btv034.

Keywords: R Language

Added by aleigh on Fri, 07 Jan 2022 10:39:51 +0200

Programming VIP