Vol.2 Mouse-based Gene Set Number Database Resources

"Like the breeze running after the clouds looking at you from far to near"

About Author
Yifan Fu
Undergraduate, BUAA, Beijing

Major : Bioinformatics
Recent focus : single-cell transcriptomics, metagenomics, multi-omics interaction
Contact : fan@buaa.edu.cn

0x00 written before

Another series of new pits has been opened. This series of articles mainly focus on bioinformatics. They try to start with R and statistics and write while learning to show some of the problems they encounter.

This series may require a basic understanding of R/Bioinformatics, but if you don't know it, you can go to the public salmon roll and search for related content under the topic "Research Toolbox". Of course, you can also chat with me directly!!!

Question 0x01 raised

Differences between Hypergeometric Distribution Test and GSEA

Usually you get a list of up-and-down differential genes, then the GO/KEGG database notes refer to the hypergeometric distribution test.

But if we don't first customize the threshold to determine the list of up-and-down differential genes, but sort all the genes according to an indicator (such as logFC) and then annotate the GO/KEGG database, it is generally GSEA analysis.
But databases are not just GO/KEGG

Known gene collections are defined in the MSigDB (Molecular Signatures Database) database:http://software.broadinstitute.org/gsea/msigdb Includes eight collections of H and C1-C7, each of which is:

H: hallmark gene sets (Cancer) A collection of 50 characteristic genes most commonly used.
C1: positional gene sets Location gene collection, based on chromosome location, totals 326 and is rarely used.
C2: curated gene sets: (Specialist) Verify the collection of genes based on pathways, literature, etc.
C3: motif gene sets: Model gene collection, mainly including microRNA And transcription factor target genes
C4: computational gene sets: Calculate gene collections, defined by mining cancer-related chip data;
C5: GO gene sets: Gene Ontology Gene ontology, including BP(Biological Processes biological process，Cell Prototype cellular component And molecular function molecular function Three parts)
C6: oncogenic signatures: Collection of cancer-specific genes, mostly from NCBI GEO  Publish chip data
C7: immunologic signatures: Collection of immune-related genes.

As you can see, GO/KEGG is the most famous, but not the only!

Two lists were found in wehi's MSigDB and after downloading, you want to match and retrieve the databases

http://bioinf.wehi.edu.au/software/MSigDB/human_H_v5p2.rdata
http://bioinf.wehi.edu.au/software/MSigDB/mouse_H_v5p2.rdata

However, for human and mouse genes, most of the human genes are capitalized, while mouse homologous genes are mostly capitalized and the rest of the letters are lowercase. However, capitalization alone does not completely convert human and mouse genes. A one-to-one list of genes can be found for conversion. Therefore, other methods are used to convert human and mouse genes.

Hope to get:
Differences in all 50 gene sets

The headers in the table below are:

Number of genes in the human genome set
 Number of genes in the mouse gene set
 Two Gene Sets overlap Number
 Number of human-specific genes
 Mouse-specific gene numbers:

0x02 Solution

Current methods are mainly case-conversion, regular expression matching, etc. However, there are cases where TP53(human) corresponds to Trp53(mouse).To solve this problem, you need to go back to the public database to retrieve matches.Improvements were made using the R-package biomaRt. Tools written in Vol.1 were borrowed.

load("human_H_v5p2.rdata") #Hs.H
load("mouse_H_v5p2.rdata") #Mm.H
if(!require(biomaRt))
  BiocManager::install("biomaRt")
library(biomaRt)
library(clusterProfiler)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
h.gene <- lapply(Hs.H, function(x){
  bitr(x, #Converting ID s from rdata to gene Symbol
       fromType = "ENTREZID",
       toType = "SYMBOL",
       OrgDb = org.Hs.eg.db)[,2]
})

m.gene <- lapply(Mm.H, function(x){
  bitr(x, 
       fromType = "ENTREZID",
       toType = "SYMBOL",
       OrgDb = org.Mm.eg.db)[,2]
})

identical(names(h.gene),names(m.gene))#gene taxonomy names are the same

human=biomaRt::useMart("ensembl","hsapiens_gene_ensembl")
mouse=biomaRt::useMart("ensembl","mmusculus_gene_ensembl")
Ms2Hs <- function(gene){
  require(biomaRt)
  geneTrans <- getLDS(
    values = gene,mart = mouse,
    attributes = "mgi_symbol",filters = "mgi_symbol",
    attributesL = "hgnc_symbol",
    martL=human)
  hs.gene <- as.vector(geneTrans$HGNC.symbol)
  names(hs.gene) <- geneTrans$MGI.symbol
  return(hs.gene)}

m2h.gene <- lapply(m.gene, function(x){
 Ms2Hs(x)
})
result <- data.frame("name"=names(h.gene))
result$m <- as.numeric(lapply(m2h.gene, function(x){length(x)}))
result$h <- as.numeric(lapply(h.gene, function(x){length(x)}))
for( i in 1:50){
  result$m.op[i] <- as.numeric((table(m2h.gene[[i]]%in%h.gene[[i]]))["TRUE"])
  result$h.op[i] <- as.numeric((table(h.gene[[i]]%in%m2h.gene[[i]]))["TRUE"])
  result$m.diff[i] <- as.numeric((table(m2h.gene[[i]] %in% h.gene[[i]]))["FALSE"])
  result$h.diff[i] <- as.numeric((table(h.gene[[i]] %in% m2h.gene[[i]]))["FALSE"])
}



#tips: the reason for the difference in mathematical summation
m2h.gene[[1]][duplicated(m2h.gene[[1]])]


#There is a situation where multiple genes correspond to the same gene in humans, and counting for this will result in differences

Summary of 0x03

Sleepy, still in the new main building, hungry, today practise chest, hot pot really fragrant.

Keywords: R Language Database Big Data

Added by auday1982 on Sun, 12 Sep 2021 19:23:48 +0300

Programming VIP

Vol.2 Mouse-based Gene Set Number Database Resources