Detailed explanation of sparse matrix in R language

Tweet before Detailed explanation of python scipy sparse matrix The principle of common sparse matrix and its application experience in python are introduced in detail. This tweet focuses on R language sparse matrix format and its application in single cell multiomics (scRNA, scATAC).

R sparse matrix

dgTMatrix

That is, Sparse matrices in triplet form, which is similar to coo in python_ Matrix is the simplest storage method of sparse matrix. It uses the form of triple (row, col, data) (or ijv format) to store the information of non-zero elements in the matrix.

dgCMatrix

Compressed, sparse, column oriented numeric matrices, which is similar to CSC in Python_ Matrix is a sparse matrix format compressed by column, which is composed of three one-dimensional arrays indptr, indices and data. dgCMatrxi is the standard format and the most common format in R-Pack matrix. This format is generally used for sparse data storage of single cell multiomics.

dgRMatrix

That is, sparse compressed, row oriented numeric matrices, which is similar to CSR in Python_ Matrix, contrary to dgCMatrxi, the sparse matrix of this format is compressed by rows. In principle, it is also composed of three one-dimensional arrays indptr, indices and data. dgRMatrxi sparse matrix is generally not used in single-cell omics analysis. The implementation of rmdxi is similar to that of rmdxi.

Common operation methods of R sparse matrix

Sparse matrix construction

######dgCMatrix######
Method 1: Matrix(*, sparse = TRUE)
#Matrix package
mat <- Matrix(data = 0L, nrow=3201, ncol = 818, sparse = TRUE) #dgCMatrix
print(object.size(mat), unit="KB")
#4.7 Kb
dim(mat)
#[1] 3201  818

Method 2: sparseMatrix()
#Matrix package
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
mat2 <- sparseMatrix(i, j, x = x) # dgCMatrix

Method 3: as(*, "dgCMatrix")
mat3 <- matrix(data=c(0,0,1,2,0,3,0,0,0), nrow=3, ncol=3) #Matrix built-in function, which is used to build a general matrix.
mat3 <- as(mat3, "dgCMatrix") # dgCMatrix
#3 x 3 sparse Matrix of class "dgCMatrix"      
#[1,] . 2 .
#[2,] . . .
#[3,] 1 3 .

Method 4: new("dgCMatrix", ...)，...Fill in at dgCMatrix Class
#Although R is a statistical language, R can also be object-oriented programming. S4 object is a standard object-oriented implementation, with special class definition function setClass() and class instantiation function new(). "dgCMatrix" is essentially a class, so we can directly instantiate this class with the new function to create objects.
indptr <-  c(2L, 0L, 2L) #Must be an integer
indices <- c(0L, 1L, 3L, 3L) #Must be an integer
data <- c(1, 2, 3)
mat4 <- new('dgCMatrix', i=indptr, p=indices, x=data, Dim=c(3L,3L))  #The Dim parameter must be an integer



######dgTMatrix######
Method 1: sparseMatrix()
#Matrix package
i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
mat1 <- sparseMatrix(i, j, x = x, repr = "T") 

Method 2: spMatrix()
#Matrix package
mat2 <- spMatrix(10,20, i = c(1,3:8),
                     j = c(2,9,6:10),
                     x = 7 * (1:7))

Method 3: as(*, "dgTMatrix") #It's the same as building dgC matrix, so I won't repeat it

Method 4: new("dgTMatrix", ...) #It's the same as building dgC matrix, so I won't repeat it

Sparse matrix format conversion

#dense matrix to sparse matrix
sparse_matrix <- as(dense_matrix, "dgCMatrix")
sparse_matrix <- as(dense_matrix, "dgTMatrix")
sparse_matrix <- as(dense_matrix, 'sparseMatrix')  # dgCMatrix
sparse_matrix <- as(dense_matrix, 'Matrix')  # dgCMatrix

#sparse matrix to dense matrix
dense_matrix <- as.matrix(sparse_matrix)
dense_matrix <- as(sparse_matrix, 'matrix')

#dgCMatrix to dgTMatrix
dgC_matrix <- as(dgT_matrix, 'dgCMatrix')
dgT_matrix <- as(dgC_matrix, 'dgTMatrix')

In fact, there is an R package SparseM Package can easily operate sparse matrix. The name of sparse matrix in this R package is the same as that in python, that is, csc, csr and coo sparse matrix. Interested readers can understand it by themselves.

Sparse matrix judgment

Sometimes it is necessary to judge whether the matrix is sparse and what kind of sparse matrix it is. You can use the following command:

i <- c(1,3:8)
j <- c(2,9,6:10)
x <- 7 * (1:7)
mat <- sparseMatrix(i, j, x = x, repr = "T") # Create dgTMatrix
is(mat, 'sparseMatrix') 
#[1] TRUE
is(mat, 'dgTMatrix') 
#[1] TRUE
is(mat, 'dgCMatrix')
#[1] FALSE
is(mat, 'Matrix') 
#[1] TRUE
is(mat, 'matrix') #It should be noted that the matrix package is usually used to operate sparse matrices, while the built-in function matrix is used to operate conventional matrices.
#[1] FALSE

Sparse matrix reading and writing

#Generally, the reading and saving of sparse Matrix is realized through Matrix package.
Matrix::readMM #Read sparse matrix
Matrix::writeMM #Save sparse matrix

Application of sparse matrix in single cell multiomics

Cellranger

The following is the result generated by Cellranger, where matrix MTX files are in dgTMatrix format, so if the downstream analysis software requires sparse matrices in other formats, conversion steps are required.

$ tree filtered_feature_bc_matrix/
filtered_feature_bc_matrix/
├── barcodes.tsv.gz
├── features.tsv.gz  
└── matrix.mtx.gz  dgT format

Seurat

One big difference between using Seurat and Scanpy to analyze single-cell data is that each behavior in the Annadata object is one cell, while the Seurat object is the opposite. dgCMatrix is used for single cell sparse data storage in Seurat; The sparse storage mode of Cellranger output to file is dgTMatrix format, so it is necessary to convert the sparse matrix format before analyzing the data output by Cellranger with Seurat. This is the core implementation of Seurat::Read10X function, which will generate dgCMatrix with row and column names. Of course, you can also use this function to build a sparse matrix by yourself, and then use the CreateSeuratObject function to generate Seurat objects. Reference codes are as follows:

## construct RNA Seurat object
dir.10x = './filtered_feature_bc_matrix'
genes = read.delim(paste0(dir.10x, 'genes.tsv'), row.names = 1)
barcodes = read.delim(paste0(dir.10x, 'barcodes.tsv'), row.names = 1)
mtx = Matrix::readMM(paste0(dir.10x, 'matrix.mtx'))
mtx = as.matrix(mtx)
mtx = as(mtx, 'dgCMatrix') #Convert the sparse matrix to Seurat's default dgC format
colnames(mtx) = row.names(barcodes)
rownames(mtx) = row.names(genes)
pbmc = CreateSeuratObject(counts = mtx, assay = 'RNA', meta.data = barcodes)

Signac

For the import of Cellranger data, you can use Seurat::Read10X and Seurat::Read10X_h5 function. However, when we download data from a public database, in many cases, we can only download three files generated by cell Ranger ATAC: barcodes tsv,matrix.mtx and peaks bed. In this case, we can either forcibly try the Seurat::Read10X function or implement a reading method ourselves.

If you directly use Seurat::Read10X to read these three files, you will find an error in the code for two reasons:

The file name is incorrect. Because the three file names read in by Seurat::Read10X by default are barcodes tsv. gz,features.tsv.gz and matrix mtx. GZ, while the file names output by cell Ranger ATAC are different.
The Seurat::Read10X function needs to read non duplicate barcodes and features(gene name or peak ID), but peaks Bed does not provide a unique peak ID.

Therefore, method 1, modify peaks The bed file content, and then read it using the Seurat::Read10X function

# peaks. The bed file format does not have a unique peak ID, so it needs to be converted.
chr1	9757	10652
chr1	86735	87645
chr1	180601	181169

#To peaks Add peak ID to bed file, and the construction method is the same as above; Then modify the file name; Finally, read with Read10X function.
data.dir <- './filtered_peak_bc_matrix'
features <- read.csv('./filtered_peak_bc_matrix/peaks.bed',  header = FALSE, row.names = NULL, sep = '\t')
row.names(features) <- paste(paste(features$V1, features$V2, sep=':'), features$V3, sep='-')   #Build peak ID
features <- features['V1']  #Take out the peak ID column separately and save it
write.table(features, './filtered_peak_bc_matrix/features.tsv', sep = '\t', row.names=F, col.names=F, quote = F)
#Next, modify other file names, that is, they can be read by the Read10X function.
counts <- Read10X(data.dir,gene.column = 1, cell.column = 1)
chrom_assay <- CreateChromatinAssay(counts = counts, 
                                    sep = c(":", "-"), genome = 'hg19',
  																	# fragments = '../vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz', #You can't provide this file without it, so you can't draw a track diagram later
  																	min.cells = 10,
  																	min.features = 200)
pbmc <- CreateSeuratObject(counts = chrom_assay, assay = "peaks")

Method 2: read the data by yourself and build the object

#Read10X_ The h5 function essentially generates a dgCMatrix with row and column names, so if we don't have an h5 file, we can directly build this matrix ourselves.
dir.10x = './filtered_feature_bc_matrix'
features = read.csv(paste0(dir.10x, 'peaks.bed'), header = FALSE, row.names = NULL, sep = '\t') 
row.names(features) <- paste(paste(features$V1, features$V2, sep=':'), features$V3, sep='-')  #Build peak ID
barcodes = read.delim(paste0(dir.10x, 'barcodes.tsv'), row.names = 1)
mtx = Matrix::readMM(paste0(dir.10x, 'matrix.mtx'))
colnames(mtx) = row.names(barcodes)
mtx = as(mtx, 'dgCMatrix') #The sparse matrix generated here is equivalent to Read10X_h5 generated sparse matrix, use this object to continue downstream analysis.
chrom_assay <- CreateChromatinAssay(counts = mtx, 
                                    sep = c(":", "-"), genome = 'hg19',
  																	# fragments = '../vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz', #If there is no such file, it can not be provided, so the fragments related analysis cannot be performed later
  																	min.cells = 10,
  																	min.features = 200)
pbmc <- CreateSeuratObject(counts = chrom_assay, assay = "peaks")