Shengxin step | transcriptome sequencing upstream analysis: hisat2+samtools+stringtie

Transcriptome analysis is widely used in the field of functional genomics. In this paper, hisat2+samtools+stringtie strategy is used to mine differentially expressed genes from transcriptome data. Here, I have sorted out the implementation process of this set of combination for future reference; At the same time, sharing on the platform, if you can help more beginners, Xiaobian will not be very honored. If there are mistakes, I also hope that all leaders can criticize and correct them.

Let's take a look at the functions performed by the software as a whole:
hisat2: establish reference genome index and compare reads
samtools: conversion of sam2bam
stringtie: estimate transcript expression

The data structure used is as follows. Here, yeast transcriptome data and reference genome are used:
The two terminal sequencing data has been filtered by fastqc, and the specific filtering process is not involved in this paper. The sample data is for reference only. During execution, we should pay attention to file format conversion and biological information contained in various formats.

@biocloud:~/1223/NGS2022$ tree
.
├── gene_data.csv//The sample of differentially expressed genes is equivalent to the standard answer of the final output.
├── genome
│   ├── yeast.fa//Reference genome file
│   ├── yeast.gff//Reference genome annotation file
│   └── yeast_transcriptome.fa // The transcriptome index file required by Kallisto is only used to realize the quantification of transcripts based on pseudo alignment, which is not involved in this paper.
├── reads//Double ended sequencing clean data, that is, it has undergone reads filtering. Usually, when getting off the plane, the company will give rawdata and cleandata at the same time. Just use the latter directly.
│   ├── s1_y_1.fq.gz
│   ├── s1_y_2.fq.gz
│   ├── s2_y_1.fq.gz
│   └── s2_y_2.fq.gz
├── script //Script file. If necessary, add an absolute path to reference the script.
│   ├── edgeR.R
│   ├── prepDE.py
│   ├── prepDE.py3
│   └── run.sh
└── src//The software installation package that may be used. If the server has installed the software, it does not need to be installed again.
    ├── fastqc_v0.11.9.zip
    ├── hisat2-2.2.1-Linux_x86_64.zip
    ├── hisat2-2.2.1-OSX_x86_64.zip
    ├── kallisto_linux-v0.46.1.tar.gz
    ├── kallisto_mac-v0.46.1.tar.gz
    ├── samtools-1.14.tar.bz2
    ├── stringtie-2.2.0.Linux_x86_64.tar.gz
    └── stringtie-2.2.0.OSX_x86_64.tar.gz

The complete execution code is as follows, which can be substituted line by line when running

$ cd ./genome
$ hisat2-build yeast.fa genome.fa //When constructing the reference genome index, there should be no space between hisat2 build.

//Perform comparison in the newly created folder
$ cd ..
$ mkdir alignment
$ cd alignment

$ hisat2 -p 6 -x ../genome/genome.fa -1 ../reads/s1_y_1.fq.gz -2 ../reads/s1_y_2.fq.gz -S ./s1.sam //The double ended sequencing reads obtained by sequencing were compared to the indexed reference genome using hisat2. It is double ended sequencing under the same treatment.

$ samtools view -bS s1.sam -o s1.bam //samtools view converts sam files into bam files. bam is the binary file of sam, which takes up less storage space after binary conversion.
$ samtools sort s1.bam s1.sorted //samtools sort will S1 BAM sorting, resulting in S1 sorted. bam.  The latter will be added automatically BAM suffix, no need to add in the command line.
$ samtools index s1.sorted.bam //Add the sorted bam file to the index

$ stringtie ./s1.sorted.bam -G ../genome/yeast.gff -e -p 2 -o ./s1_out.gtf -A ./s1_genes.list //stringtie estimated the transcript expression according to gtf annotation and bam comparison results after sequencing

//The comparison result of two terminal sequencing under another treatment was obtained by the same method: s2_out.gtf and s2_genes.list.  It will not be repeated here one by one.

//Perform summary expression operations in the newly created folder
$ cd ..
$ mkdir differential_expression
$ cd differential_expression

$ vim sample_list.txt #Manually create and edit sample_list.txt file, the content is two names + the address of the corresponding gtf file. Note that the second line does not have a newline symbol.
// sample_list.txt file looks like this. Here, two gtf files are placed in differential_ Under the expression folder.
// s1 /mnt/1223/NGS2021/differential_expression/s1_out.gtf
// s2 /mnt/1223/NGS2021/differential_expression/s2_out.gtf
$ python prepDE.py3 -i sample_list.txt -g gene_count.csv -t transcript.csv //The script of stringtie generates a list of differentially expressed genes.
// Note that the stringtie differential gene summary script has prepde Py and prepde There are two versions of PY3. The former is applicable to the python2 environment and the latter is applicable to the python3 environment. Mixed use will report an error. This article is based on Python 3 9.5 environment.

Gene obtained after performing the above steps_ count. CSV is the final result, which can be imported into R language for differential expression gene analysis with edgeR / DESeq2 package. Have the opportunity to reorganize the follow-up tutorials.

sam file Foundation

sam file is very important in sequence analysis, whether transcriptome analysis or genomic call SNP. Understanding the information contained in the sam file helps to understand the data. We still take the command line that generates the sam file in this article as an example. The core comparison step of transcriptome analysis is to use hisat2 to compare the sequenced reads to the indexed reference genome:

$ hisat2 -p 6 -x ../genome/genome.fa -1 ../reads/s1_y_1.fq.gz -2 ../reads/s1_y_2.fq.gz -S ./s1.sam
  • -x reference genome fa file
  • -1 double ended sequencing, paragraph 1 fq.gz format
  • -2 double ended sequencing, paragraph 2 fq.gz format
  • -S output file address + name, and the output result is in SAM format
    Double ended sequencing is generally read1 fq / read1. fq. gz / read1. fq. Bz2 format, both front and rear ends appear at the same time and are input as hisat2 at the same time. The sequenced fastq file was compared to the reference genome through hisat2 to obtain SAM file. SAM file is a sequence alignment file, which has upper and lower parts, including header comments and alignment results:
Header comment section
//The header comment section starts with @:
@HD     VN:1.0  SO:unsorted //HD line: VN version and comparison sort type. SO:unsorted shown here indicates that there is no sort order.
@SQ     SN:NC_001133.9  LN:230218//SQ line: reference sequence directory. SN: reference sequence name. LN: reference sequence length
@SQ     SN:NC_001148.4  LN:948066
@SQ     SN:NC_001224.1  LN:85779
@SQ     SN:NC_001140.6  LN:562643
@SQ     SN:NC_001141.2  LN:439888
@SQ     SN:NC_001142.9  LN:745751
@SQ     SN:NC_001143.9  LN:666816
@PG     ID:hisat2       PN:hisat2       VN:2.0.5        CL:"/mnt/bai/public/bin/hisat2-2.0.5/hisat2-align-s --wrapper basic-0 -p 8 -x ../genome/genome.fa --dta -S ./s1.sam -1 /tmp/1909596.inpipe1 -2 /tmp/1909596.inpipe2"
//PG line: the name of the comparison program used, here hisat2
Comparison results

Each line of the alignment result part represents the alignment result information of a read and the reference genome. The first 11 columns are the main columns and contain most important information.

SRR5511068.3    99      NC_001147.6     1002516 60      75M     =       1002624 184     GTACTTAACATTCTTCTAATCATGTTAAAAGGTAAAACCTGGCCCATTTTACGATCGATCTGTAAAATCTTATAC     AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEAEEEEEEEEEEEAEEEEEEEAEEEEEEEEEEA     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1
SRR5511068.3    147     NC_001147.6     1002624 60      76M     =       1002516 -184    GATAAGTAAGCAATGGTGGTAATTGCAATATTTTGCATATGTGCACGAGAAGAACTATTTTGAAGTAAGATCACTG    /EEEE<EAEEEEEE6E/EEEEEEEAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEAAAAA    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:76 YS:i:0  YT:Z:CP NH:i:1
SRR5511068.9    83      NC_001146.8     299462  60      75M1S   =       299435  -103    ATTGGAAAAGAAAGTCGCGGCAAAGAGAAATGCCAATAAGACCGGGAATCAAAATTCTAAAAAGAAGAGTCAGAAG    <EAAA6<A<E/EEA/A</AE/EEEAA<EEAE6AEEAAEEAEEEAAAEE/EEEEEEEEEEEE//EEEEEE/EAAAAA    AS:i:-1 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:75 YS:i:0  YT:Z:CP NH:i:1

The main part is separated by tab, from left to right:

  • QNAME: Query Name, sequencing reads name. If it is double ended sequencing, there will be a second time, and both ends will compare the last time.
  • FLAG: the number to the power of 2 or its sum. Each power of 2 represents a situation. For details, please refer to CSDN boss's blog: https://blog.csdn.net/genome_denovo/article/details/78712972
    Or Jane Book boss:
    https://www.jianshu.com/p/ab133ee9712c
  • RENAME: Reference Name, the name of the reference sequence on the double ended sequencing R1 alignment, which is *.
  • POS: position, the position sequence number on the initial comparison of double ended sequencing R1.
  • MAPQ: Mapping Quality, the higher the quality score, the more accurate it is.
  • CIGAR: comparison result, M stands for complete match.
  • RNEXT: the comparison of R2 end of double ended sequencing. If the comparison fails, the * sign is used, and the = sign is used for the same section.
  • MPOS: double ended sequencing, R2 end comparison position.
  • ISIZE: Library insertion length, R2 end position - R1 end position + CIGAR.
  • SEQ: sequence information.
  • QUAL: read quality, expressed in ASCII code.
In conclusion:

The steps of transcriptome data analysis are not only the comparison of differentially expressed genes, but also the significance analysis, correlation analysis, GO and KEGG analysis of differentially expressed genes. This paper only summarizes the upstream steps of transcriptome analysis. In terms of upstream steps, there are many kinds of comparison software available. This paper only uses the classic steps of hisat2+samtools+stringtie to learn the upstream analysis of transcriptome. Other comparison software shall be sorted out if involved in the future.

Keywords: Linux

Added by sqlmc on Wed, 12 Jan 2022 07:00:46 +0200