How to draw scatter diagram with R language

R language draws the "symmetrical scatter diagram" of gene expression genes

In transcriptome analysis, how to express the differentially expressed genes between the two groups? You may think of using volcano map for the first time. Indeed, volcano map is the most frequently used. In volcano map, it is easy to Fold between two groups according to genes

Change value and significance p value to identify and judge the profile of differentially expressed genes. Volcano map is essentially a scatter map. Generally, the abscissa and ordinate represent the Fold Change and Fold Change after log2 transformation respectively-

log10 converted p value or P adjustment value information (left in the figure below). When it comes to scatter plot, another common style to show differentially expressed genes is that the horizontal and vertical axis can represent the average expression of two groups of genes respectively. This style can make it more convenient and intuitive to compare the differential status of genes in the two groups.

1 sample file

The example file "gene_diff.txt" is the result of a group of gene differential expression analysis. It records the genes with significantly inconsistent expression between the treat ment group and the control group. The identification criteria are p < 0.01 and | log2

Fold Change|≥1.

Where, gene_id is the gene name; control and treat represent the average expression values of genes in the two groups; log2FoldChange is the multiple of gene expression difference after log2 transformation; pvalue is the significant p value of different genes; diff is based on P < 0.01 and | log2

For the differential genes screened by Fold Change | ≥ 1, "up" is up-regulated, "down" is down regulated, and "none" is non differential genes.

Next, the process of drawing "symmetric scatter diagram" of differential gene expression using R language is shown through the example file.

2 data preprocessing

First, do some preprocessing on the data.

For example, if the magnitude difference of gene expression values is too large, take a logarithmic conversion; Gene names are sorted according to whether they are differential genes or not to avoid being covered by insignificant gene points in subsequent mapping, that is, the purpose of sorting is to make the points of these significant genes located at the top of the map.

#Read sample data
express <- read.delim('gene_diff.txt', sep = '\t')
#Take a log(1 +) to convert the gene expression value
express$control <- log(express$control+1)
express$treat <- log(express$treat+1)
#The purpose of sorting is to display the significant genes in the front layer to avoid being covered by the points of insignificant genes
express$diff <- factor(express$diff, levels = c('up', 'down', 'none'))
express <- express[order(express$diff, decreasing = TRUE), ]
head(express)  #View the data table after reading and preprocessing

3 draw the scatter diagram of differential genes, and the color represents the differential genes

After that, you can use the preprocessed data for mapping.

The first type is to color the genes according to the up-regulated, down-regulated or insignificant types, so as to identify the differential genes from the map. We used ggplot2 to draw the scatter map of differential genes.

#Scatter plot was drawn, and the up and down regulated genes were distinguished by different colors
library(ggplot2)
ggplot(express, aes(x = control, y = treat)) +
geom_point(aes(color = diff), size = 1) +  #Press up or down to specify the color of gene points
scale_color_manual(values = c('red', 'gray', 'green4'), limit = c('up', 'none', 'down')) +  #Up down gene color assignment
theme_bw() +  #Background adjustment
labs(x = 'control group', y = 'treat group', color = '') +  #Axis title settings
geom_abline(intercept = 1, slope = 1, col = 'black', linetype = 'dashed', size = 0.5) +  #These three sentences are used to add the threshold line of | log2fc | > 1
geom_abline(intercept = -1, slope = 1, col = 'black', linetype = 'dashed', size = 0.5) +
geom_abline(intercept = 0, slope = 1, col = 'black', linetype = 'dashed', size = 0.5)

The two coordinate axes represent the treat ment group and the control group respectively, and the points in the figure represent the average expression value of each gene in the two groups (log conversion has been made). Compared with the control group, the up-regulated genes were expressed in red and the down-regulated genes were expressed in green. The dotted line in the figure represents the threshold line when | log2FC|=1.

In this figure, we can easily observe the overall distribution status and quantity comparison of differential genes.

4 draw the scatter diagram of differential genes, and the color represents the p value

The p value information is not shown in the figure above. Therefore, another idea is that the color represents the p value, so that a gradient can be obtained in the graph. Similarly, the method of ggplot2 is used to draw. Compared with the above process, there is only difference in color assignment.

#Gradient dispersion plot according to p value
ggplot(express, aes(x = control, y = treat)) +
geom_point(aes(color = pvalue), size = 0.8) +  #Specifies the color of the gene point according to the size of the p value
scale_color_gradient2(low = 'red', mid = 'darkgoldenrod2', high = 'royalblue2', midpoint = 0.5) +  #Gradient color assignment
theme_bw() +  #Background adjustment
labs(x = 'control group', y = 'treat group', color = 'p-value') +  #Axis title settings
geom_abline(intercept = 1, slope = 1, col = 'black', linetype = 'dashed', size = 0.5) +  #These three sentences are used to add the threshold line of | log2fc | > 1
geom_abline(intercept = -1, slope = 1, col = 'black', linetype = 'dashed', size = 0.5) +
geom_abline(intercept = 0, slope = 1, col = 'black', linetype = 'dashed', size = 0.5)

Similar to the above figure, the two coordinate axes represent the treat ment group and the control group respectively. The points in the figure represent the average expression value of each gene in the two groups (log conversion has been made), and the dotted line in the figure represents the threshold line when | log2FC|=1.

The difference from the above figure is that at this time, the gene is colored according to the significance p value, and the non significant > significant display is gradually displayed in blue > red to obtain a gradient information. In this way, it can be easily seen that the greater the difference in expression value between the two groups, the smaller the p value. The trend of the two is consistent, focusing on describing the relationship between the difference multiple and P value.

Added by Velausanakha on Wed, 05 Jan 2022 15:41:37 +0200