Workshop issue 15: heat map

Click the link to view the video of station b.

https://www.bilibili.com/video/BV1Pa4y1E7WS

reference material

Document of pheatmap package
pheatmap introduction: https://www.jianshu.com/p/42c643a43a1f#fn3
Introduction to correlation coefficient matrix package: https://www.jianshu.com/p/b76f09aacd9c
Color block control: https://www.jianshu.com/p/6b765e83d723
Color matching: https://www.jianshu.com/p/e50babec45cb

R package tools

There are many packages that can draw heat maps. Here are some examples

pheatmap::pheatmap (common heat map)
stats::heatmap (basic heat map)
gplots::heatmap.2 (basic enhanced heat map)
ComplexHeatmap (complex heat map for genome analysis)
ggplot2::ggplot (ggplot visualization)
heatmaply::heatmaply (Interactive)
......

Why do heat maps?

Heat map can intuitively show the relationship or difference between multiple variables through color depth
The heat map can show multivariable clustering results

Basic elements

Data preprocessing
Main parameters of heat map
Note content (significance; legend)

solve the problem

Color block control (hot map, red and blue block problem, number size and positive and negative value problem)
Show partial heat map
Marked correlation coefficient Heatmap significance
The numerical difference is too large, and the distinction between heat maps is not obvious
Color matching

pheatmap package

1. Brief introduction

The pheatmap function in the pheatmap package can perform k-means clustering and hierarchical clustering on the input data, but the clustering cannot be performed when the number of rows of data exceeds 1000.

2. Function parameters

pheatmap(mat, color = colorRampPalette(rev(brewer.pal(n = 7, name =
  "RdYlBu")))(100), kmeans_k = NA, breaks = NA, border_color = "grey60",
  cellwidth = NA, cellheight = NA, scale = "none", cluster_rows = TRUE,
  cluster_cols = TRUE, clustering_distance_rows = "euclidean",
  clustering_distance_cols = "euclidean", clustering_method = "complete",
  clustering_callback = identity2, cutree_rows = NA, cutree_cols = NA,
  treeheight_row = ifelse((class(cluster_rows) == "hclust") || cluster_rows,
  50, 0), treeheight_col = ifelse((class(cluster_cols) == "hclust") ||
  cluster_cols, 50, 0), legend = TRUE, legend_breaks = NA,
  legend_labels = NA, annotation_row = NA, annotation_col = NA,
  annotation = NA, annotation_colors = NA, annotation_legend = TRUE,
  annotation_names_row = TRUE, annotation_names_col = TRUE,
  drop_levels = TRUE, show_rownames = T, show_colnames = T, main = NA,
  fontsize = 10, fontsize_row = fontsize, fontsize_col = fontsize,
  angle_col = c("270", "0", "45", "90", "315"), display_numbers = F,
  number_format = "%.2f", number_color = "grey30", fontsize_number = 0.8
  * fontsize, gaps_row = NULL, gaps_col = NULL, labels_row = NULL,
  labels_col = NULL, filename = NA, width = NA, height = NA,
  silent = FALSE, na_col = "#DDDDDD", ...)

3. Data processing (create a matrix here first)

> library(pheatmap)
> # Create test matrix
> test = matrix(rnorm(200), 20, 10)
> test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3
> test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2
> test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4
> colnames(test) = paste("Test", 1:10, sep = "")
> rownames(test) = paste("Gene", 1:20, sep = "")
> head(test, 5)
          Test1       Test2    Test3      Test4    Test5      Test6    Test7      Test8
Gene1 2.9506296  0.04980323 3.385544  1.1669879 1.885813 -0.9771256 3.112077 -0.6027554
Gene2 2.0099044  0.51020937 4.011563  0.2758383 5.311161 -0.5104894 1.078437  1.2681561
Gene3 3.8667236 -1.00110670 1.763909 -0.4958210 2.814205  0.9438923 2.238253  2.6206316
Gene4 2.1241335 -0.17389532 3.381439  0.9546686 5.046722 -1.5613329 2.322061 -1.4255754
Gene5 0.2062854  0.45703810 3.471799 -1.2390113 2.333674 -1.0852306 3.996635 -0.8064933
         Test9       Test10
Gene1 3.365981 -1.257651534
Gene2 0.727328 -0.004095374
Gene3 5.569084  0.020165678
Gene4 2.413953 -0.644605108
Gene5 2.838279 -0.004935564

4. Draw heat map

Default parameters

> pheatmap(test)

The parameters are not set. Clustering is the default. By default, it clusters both rows and columns of matrix data. You can set it separately to cluster only rows or columns.

> pheatmap(test, cluster_row = FALSE)

The parameter needs to be set with Boolean value, cluster_row = FALSE, that is, the rows are not clustered

K-means clustering

> pheatmap(test, kmeans_k = 2)

k means clustering can set the number of clusters by itself. After clustering, the number of variables contained in the cluster will be automatically displayed. The specific variables in the class are not directly displayed on the graph, but this information can be obtained by viewing the heat map list.

> test_k <- pheatmap(test, kmeans_k = 2)
> cluster <- test_k$kmeans$cluster
> cluster
 Gene1  Gene2  Gene3  Gene4  Gene5  Gene6  Gene7  Gene8  Gene9 Gene10 Gene11 Gene12 Gene13 
     2      2      2      2      2      2      2      2      2      2      2      1      2 
Gene14 Gene15 Gene16 Gene17 Gene18 Gene19 Gene20 
     1      1      1      1      1      1      1

The variables contained in the cluster can be further extracted

> cluster_1 <- names(cluster[which(cluster == 1)])
> cluster_1
[1] "Gene12" "Gene14" "Gene15" "Gene16" "Gene17" "Gene18" "Gene19" "Gene20"
> cluster_2 <- names(cluster[which(cluster == 2)])
> cluster_2
 [1] "Gene1"  "Gene2"  "Gene3"  "Gene4"  "Gene5"  "Gene6"  "Gene7"  "Gene8"  "Gene9"  "Gene10"
[11] "Gene11" "Gene13"

Normalization and clustering basis

Set the clustering basis of rows or columns separately. The basis here is Pearson coefficient. You can select 'correlation', 'euclidean', 'maximum', 'manhattan', 'canberra', 'binary', 'minkowski' to change the parameters

Normalization processing is also done here. It can be seen that the content of color block reference also changes with the change of data range.

> pheatmap(test, scale = "row", clustering_distance_rows = "correlation")

color setting

Take a look at this colorRampPalette. It takes 50 colors to mark the relationship and difference between data.

The larger the value of colorRampPalette is set, the more diverse the colors are, and the clearer the data gap can be reflected.

> pheatmap(test, color = colorRampPalette(c("navy", "white", "firebrick3"))(50))
> colorRampPalette(c("navy", "white", "firebrick3"))(50)
 [1] "#000080" "#0A0A85" "#14148A" "#1F1F8F" "#292994" "#343499" "#3E3E9F" "#4848A4" "#5353A9"
[10] "#5D5DAE" "#6868B3" "#7272B9" "#7C7CBE" "#8787C3" "#9191C8" "#9C9CCD" "#A6A6D2" "#B0B0D8"
[19] "#BBBBDD" "#C5C5E2" "#D0D0E7" "#DADAEC" "#E4E4F2" "#EFEFF7" "#F9F9FC" "#FDFAFA" "#FBF1F1"
[28] "#F9E8E8" "#F7E0E0" "#F5D7D7" "#F3CECE" "#F1C5C5" "#EFBCBC" "#EDB3B3" "#EBAAAA" "#E9A2A2"
[37] "#E79999" "#E59090" "#E38787" "#E17E7E" "#DF7575" "#DD6C6C" "#DB6464" "#D95B5B" "#D75252"
[46] "#D54949" "#D34040" "#D13737" "#CF2E2E" "#CD2626"

Hierarchical clustering method

> pheatmap(test,clustering_method = "ward")

clustering_method optional parameters include 'ward', 'ward D', 'ward. D2 ',' single ',' complete ',' average ',' mcquitty ',' intermediate 'or' centroid ', these are the calculation methods of the distance between classes in hierarchical clustering.

legend = FALSE

> pheatmap(test, cluster_row = FALSE, legend_breaks = -1:4, legend_labels = c("0",
                                                                            "1e-4", "1e-3", "1e-2", "1e-1", "1"))

Here, the custom setting range of legend is between - 1 and 4, and the label is set to these 6 numbers. You can set the range you want to see in a certain color range at a glance.

Color block internal information

> pheatmap(test, display_numbers = TRUE)
> pheatmap(test, display_numbers = TRUE, number_format = "%.1e")

Display matrix information. When the value is large or small, you can add parameter setting number_ Format = '%. 1E', displays content in scientific counting.

> pheatmap(test, display_numbers = matrix(ifelse(test > 5, "*", ""), nrow(test)))

Here, the original matrix is processed and a new matrix is constructed to store the marks. The marks greater than 5 are * * * * *. If it is a heat map made with correlation coefficient, the significance mark can be added in this way.

Set the heat map and font size and title content

> pheatmap(test, cellwidth = 15, cellheight = 12, fontsize = 8, main = "Example heatmap", filename = "test.pdf")

cellwidth sets the heat map length, cellheight sets the heat map height, main sets the theme, fontsize sets all fonts, and filename sets the heat map.

5. Notes to heat map

Create comment information for rows and columns

> # Generate annotations for rows and columns
> annotation_col = data.frame(
+   CellType = factor(rep(c("CT1", "CT2"), 5)), 
+   Time = 1:5
+ )
> rownames(annotation_col) = paste("Test", 1:10, sep = "")
> annotation_row = data.frame(
+   GeneClass = factor(rep(c("Path1", "Path2", "Path3"), c(10, 4, 6)))
+ )
> rownames(annotation_row) = paste("Gene", 1:20, sep = "")
> annotation_col
       CellType Time
Test1       CT1    1
Test2       CT2    2
Test3       CT1    3
Test4       CT2    4
Test5       CT1    5
Test6       CT2    1
Test7       CT1    2
Test8       CT2    3
Test9       CT1    4
Test10      CT2    5
> annotation_row
       GeneClass
Gene1      Path1
Gene2      Path1
Gene3      Path1
Gene4      Path1
Gene5      Path1
Gene6      Path1
Gene7      Path1
Gene8      Path1
Gene9      Path1
Gene10     Path1
Gene11     Path2
Gene12     Path2
Gene13     Path2
Gene14     Path2
Gene15     Path3
Gene16     Path3
Gene17     Path3
Gene18     Path3
Gene19     Path3
Gene20     Path3

Heat map for annotation information

> # Display row and color annotations
> pheatmap(test, annotation_col = annotation_col, annotation_row = annotation_row)

Additional information can be displayed through the added color block, and the rows and columns can be annotated respectively. In addition, the annotated color bar can cancel the annotation_legend = FALSE, but it is not clear what the color represents.

Change color for annotation information color block

> pheatmap(test, annotation_col = annotation_col, annotation_row = annotation_row, 
+          annotation_colors = ann_colors)

Block the heat map

> pheatmap(test, annotation_col = annotation_col, cluster_rows = FALSE, gaps_row = c(10, 14), 
+          cutree_col = 2)

gaps_row divides the row. c(10,14) divides the row into three parts with 10 and 14 as dividing lines.

cutree_col splits the column, 2 that is, it is divided into two blocks.

Change heat map row and column names

You can create a new vector to store the row and column names you want to set and pass them to label_row this parameter.

> labels_row = c("", "", "", "", "", "", "", "", "", "", "", "", "", "", "", 
+                "", "", "Il10", "Il15", "Il1b")
> pheatmap(test, annotation_col = annotation_col, labels_row = labels_row)

Heat map sorting

> callback = function(hc, mat){
+   sv = svd(t(mat))$v[,1]
+   dend = reorder(as.dendrogram(hc), wts = sv)
+   as.hclust(dend)
+ }
> pheatmap(test, clustering_callback = callback)

Solution 1: color block control

In many cases, we need to intuitively look through the color. When the data is within the range of positive and negative values, the ideal situation is to take 0 as white, and the other two color systems represent positive and negative values respectively, which is easy to read at a glance.

Take the correlation coefficient matrix between test s as an example to check the value range:

> library(Hmisc)
> cortest <- rcorr(as.matrix(test), type = "pearson")
> r_value <- cortest$r
> range(r_value)
[1] -0.8980495  1.0000000

Set color block (set 0 to white by symmetry)

> pheatmap(r_value,
+          color = c(colorRampPalette(c("blue","white"))(10),
+                    colorRampPalette(c("white","red"))(10)),
+          legend_breaks=seq(-1,1,0.2))

Problem solving 2: display part of the heat map

The data form used in the phatmap package is a matrix. When performing correlation analysis inside variables, there is half of the data redundancy, but the matrix form can only change the redundant information into 0 through test [upper.tri (test)] < - Na, which can not be removed directly. It seems that there are no parameters in the parameters that can directly draw the upper and lower triangle heat map, or the upper and lower triangle heat map can be drawn.

[if ggplot2 is used for heat map drawing, since its input data is long data, the data can be converted through melt() in reshape package for drawing]

> r_value[upper.tri(r_value)] <- 0  
> pheatmap(r_value, 
         c(colorRampPalette(c("blue","white"))(5),
          colorRampPalette(c("white","red"))(5)),
         legend_breaks=seq(-1,1,0.2),
         cluster_rows=F, cluster_cols=F, border_color=NA)

However, there is a big BUG, that is, the data itself still exists, and the removed triangular matrix part should be changed into the value corresponding to the white color block.

Solution 3: Mark correlation coefficient heat map significance

Here, a package Hmisc for calculating the correlation matrix is introduced, which can calculate the correlation matrix, and then directly use the correlation matrix to draw the heat map, and then according to the display_ The numbers parameter is used for significance labeling.

Suppose now I want to see the heat map of the correlation between these 10 test s:

> library(Hmisc)
> cortest <- rcorr(as.matrix(test), type = "pearson")
> p_value <- ifelse(is.na(cortest$P), "", cortest$P)
> pheatmap(cortest$r, display_numbers = matrix(ifelse(p_value < 0.05, "*", ""), nrow(p_value)))

spearman can be selected as the relevant type

Similarly, if you want to see the heat map of the correlation relationship between these 20 genes, you can transpose the data frame t(), because the function for calculating the correlation matrix performs correlation analysis on the variables of the column by default.

Solution 4: take logarithm

> pheatmap(log10(test))

Problem 5: color matching

> library("RColorBrewer")
> display.brewer.all()
> brewer.pal.info
         maxcolors category colorblind
BrBG            11      div       TRUE
PiYG            11      div       TRUE
PRGn            11      div       TRUE
PuOr            11      div       TRUE
RdBu            11      div       TRUE
RdGy            11      div      FALSE
RdYlBu          11      div       TRUE
RdYlGn          11      div      FALSE
Spectral        11      div      FALSE
Accent           8     qual      FALSE
Dark2            8     qual       TRUE
Paired          12     qual       TRUE
Pastel1          9     qual      FALSE
Pastel2          8     qual      FALSE
Set1             9     qual      FALSE
Set2             8     qual       TRUE
Set3            12     qual      FALSE
Blues            9      seq       TRUE
BuGn             9      seq       TRUE
BuPu             9      seq       TRUE
GnBu             9      seq       TRUE
Greens           9      seq       TRUE
Greys            9      seq       TRUE
Oranges          9      seq       TRUE
OrRd             9      seq       TRUE
PuBu             9      seq       TRUE
PuBuGn           9      seq       TRUE
PuRd             9      seq       TRUE
Purples          9      seq       TRUE
RdPu             9      seq       TRUE
Reds             9      seq       TRUE
YlGn             9      seq       TRUE
YlGnBu           9      seq       TRUE
YlOrBr           9      seq       TRUE
YlOrRd           9      seq       TRUE

View specific colors

> display.brewer.pal(n = 9, name = 'YlOrBr')

supplement

ComplexHeatmap provides a new function to directly convert pheatmap into ComplexHeatmap

source: https://jokergoo.github.io/2020/05/06/translate-from-pheatmap-to-complexheatmap/

Using corrplot package to do correlation heat map is also a good choice

Editor: Wu Pancheng

Added by jiehuang001 on Fri, 21 Jan 2022 17:08:23 +0200

Programming VIP