PCA of sample cohort grouping based on gene set

With the continuous decline of the price of ngs, the transcriptome sequencing project has long entered the era of large samples. Of course, there was such a trend as early as the chip price was close to the people. At present, the price of single-cell transcriptome is also following this old path.

Then, for the transcriptome of large sample cohort, there is often no known reasonable grouping. At this time, it will be artificially grouped to see the heterogeneity of the cohort, such as grouping according to the level of immunity.

There are many ways to implement this grouping according to the immune level. Let's briefly demonstrate the hierarchical clustering of PCA and heat map and the scoring grouping such as gsea or gsva to see if there is any difference.

First look at the PCA grouping of the target gene set

Need to load step1 output The expression matrix in rdata file, oh, if you don't know step1 output Rdata if you get it, look at the code at the end of the text.

First, we need to advance the small expression matrix of immune gene set, and the code is as follows:

load(file = 'step1-output.Rdata')
cg=c('CD3D','CD3G CD247','IFNG','IL2RG','IRF1','IRF4','LCK','OAS2,STAT1')
cg=cg[cg %in% rownames(dat)]
library("FactoMineR") #These two packages need to be loaded to draw the principal component analysis diagram
dat.pca <- PCA(as.data.frame(t(dat[cg,]))  )
             geom.ind = "point", # show points only (nbut not "text")
             # col.ind =  group_list, # color by groups 
             addEllipses = T,
             legend.title = "Groups"

As follows:

It can be seen that each sample has coordinates on this principal component analysis chart:

> head(dat.pca$ind$coord[,1:2])
                Dim.1      Dim.2
GSM1419942 -1.5307665  0.1820390
GSM1419943  0.2224343  0.9908672
GSM1419944 -2.1294537  0.3075360
GSM1419945  0.0745490  0.2147092
GSM1419946 -4.8646160 -0.2016098
GSM1419947  0.9149908  0.2421981

We can simply divide the sample into two groups according to PC1, or divide it into four groups according to PC1 and PC2. This step is very random. Let's try to divide the sample into two groups according to PC1. The code is as follows:

group_list=ifelse(dat.pca$ind$coord[,1] < 0 ,'neg','pos')
pca_gl = group_list
# Where hclust_gl comes from the previous tutorial

It can be seen that the sample grouping of the previous hierarchical clustering is still different from the grouping of PC1 of the current PCA:

> table(pca_gl,hclust_gl)
pca_gl high low
   neg    1  59
   pos   37  10

However, they are visually related, but there is not much difference:

Differences between the two groups

The naked eye basically can't see the difference. The difference should be those samples near abscissa 0!

The code is:

             geom.ind = "point", # show points only (nbut not "text")
             col.ind =  pca_gl, # color by groups 
             addEllipses = T,
             legend.title = "Groups"
               geom.ind = "point", # show points only (nbut not "text")
               col.ind =  hclust_gl, # color by groups 
               addEllipses = T,
               legend.title = "Groups"

It's very interesting. The samples with high and low immune classification in conflict between the two methods should have its special significance!

About step1 output Rdata this file

The above code loads step1 - output Rdata is a file. Here is how to make this file. The code is as follows:

rm(list = ls())  ## Magic operation, one key emptying~
options(stringsAsFactors = F)

gset <- geoChina("GSE58812")
a=gset[[1]] #
dat=exprs(a) #Now we need to know that expa is an object through the description of this function
dim(dat)#Look at the dimension of dat matrix
#   GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array
dat[1:4,1:4] #Look at dat the 1 to 4 rows and 1 to 4 columns of this matrix. The row before the comma and the column after the comma
#By looking at the instruction manual, you can know to take the clinical information in object a and use pData
## Select some clinical phenotypes of interest.
group_list= ifelse(pd$`meta:ch1` == '0' ,'stable','meta')
ids=ids[ids$symbol != '',]
dat=dat[rownames(dat) %in% ids$ID,]

ids=ids[ids$probe_id %in%  rownames(dat),]

ids$median=apply(dat,1,median) #ids creates a new column called media. At the same time, it operates the dat matrix by row, takes the median of each row, and gives the result to each row of the column media
ids=ids[order(ids$symbol,ids$median,decreasing = T),]#Sort ids$symbol according to the order of ids$median from large to small, and assign the corresponding row as a new ids
ids=ids[!duplicated(ids$symbol),]#Take out the duplicate item from the symbol column, '!' If it is no, that is, take out the non duplicate items, remove the duplicate gene, and retain the maximum expression result of each gene s
dat=dat[ids$probe_id,] #Remove the probe from the new ids_ In the column ID, the dat is formed into a new dat according to each row in the extracted column
rownames(dat)=ids$symbol#Give each row in the symbol column of ids to dat as the row name of dat
dat[1:4,1:4]  #Keep the information of the first occurrence of each gene ID

     file = 'step1-output.Rdata')

If you really think my tutorial is helpful to your scientific research project and makes you enlightened, or your project uses a lot of my skills, please add a short thank you when publishing your achievements in the future, as shown below:

We thank Dr.Jianming Zeng(University of Macau), and all the members of his bioinformatics team, biotrainee, for generously sharing their experience and codes.

If I have traveled around the world in universities and research institutes (including Chinese mainland) in ten years, I will give priority to seeing you if you have such a friendship.

Added by Miri4413 on Thu, 03 Mar 2022 07:50:35 +0200