Viromics data analysis-03 Virsorter2 virus sequence recognition

VirSorter2 uses multiple classifiers and expert guided methods to detect different DNA and RNA virus genomes. It has made major updates to its previous version (virsorter):

  • Combined with more viruses, including dsDNA phage, ssDNA virus, RNA virus, NCLDV (nuclear cell virus), lavaviridae (virophages);

  • Applying machine learning to estimate virulence using genomic features, including structural / functional / taxonomic annotations and viral marker genes;

  • Use high-quality viral genomes from metagenomes or other sources for training.

install

Based on conda

conda create -n vs2 -c conda-forge -c bioconda virsorter=2
conda activate vs2
#It takes about 10min to install the database
virsorter setup -d db -j 10

#If the installation fails, it can also be installed manually, https://osf.io/v46sc/download
#After decompression, pass the path to -- DB dir
tar -xzf db.tgz
virsorter config --init-source --db-dir=./db

function

virsorter run -w test.out -i test.fa --min-length 1500 -j 4 all

#Re run the program quickly based on different min scores (only run the classify step), and add the suffix (label) to the new file as rerun
virsorter run -w test.out -i test.fa --include-groups "dsDNAphage,ssDNA" -j 4 --min-score 0.9 --label rerun classify

#Increase the number of hmmscan threads to improve the running speed
virsorter config --set HMMSEARCH_THREADS=4

Interpretation of results

  • final-viral-combined.fa

The identified virus sequence includes two types: - the complete sequence determined as the virus (with the suffix | full)- Some sequences are identified as viruses (with suffix | {i}_partial); Here, {i} can be a number from 0 to the maximum number of virus fragments found in the overlap group- Short (less than two genes) sequence, and its marker gene is identified as virus (with suffix | lt2gene);

  • final-viral-score.tsv

This table can be used to further filter the results. It includes the following columns: - sequence name - score of each virus sequence across groups (multiple columns) - maximum score across groups - maximum score group - overlap length - marker gene count - percentage of virus genes - percentage of non virus genes

Note: the virus sequences of classifiers are not exclusive to each other, and their target virus sequences may overlap, which means that this information should not be used or regarded as a reliable taxonomic classification. The localization of VirSorter2 is limited to virus recognition.

  • final-viral-boundary.tsv

Only some columns in this file may be useful: - seqname: original sequence name - trim_orf_index_start,trim_orf_index_end: start and end of the original sequence of the identified virus sequence ORF index - trim_bp_start,trim_bp_end: the beginning and end of the original sequence of the identified virus sequence - trim_pr: the score of the final pruned virus sequence - partial: the whole sequence is a virus or part of the sequence is a virus; When the complete sequence has a score > score cutoff, it is complete (0), or any virus sequence extracted therein is partial (1) - pr_full: the score of the original sequence - hallmark_cnt: marker gene count - group: the classifier of virus group with high score; This should not be used as a reliable classification

Note: VirSorter2 sometimes tends to overestimate the size of the virus sequence during virus pre extraction to achieve better sensitivity. It is recommended to clean up these previral predictions to remove potential host genes at the edge of the predicted virus region, such as using the CheckV tool

Training custom classifiers

see Bitbucket

proposal

1. How to choose the dividing line?

A: Generally speaking, people with a score > 0.9 have high confidence. Those scoring between 0.5 and 0.9 may be a mixture of virus and non virus. It is difficult to find the best score to distinguish between virus and non virus because it depends on the percentage of host sequence and unknown sequence. Therefore, it is recommended to use the default cut-off value (0.5) to obtain maximum sensitivity, and then use checkV to apply quality check steps to eliminate false positives (except prediction integrity).

2. How to judge whether the identified virus sequence is the original virus?

A: Only part of the virus sequence (ending with _partial) can be identified as the original virus. The complete virus sequence in VirSorter2 (ending with | full) is defined as an overlapping group with significant virus signal (score > = 0.8) as a complete sequence. Therefore, some may also be protoviruses: 1) they may be fragments inside the protovirus; 2) Although there are some host genes at the end, the whole sequence has a strong viral signal.

3. Why prune the complete virus sequence (ending with | full)?

A: in three cases, the complete virus sequence can be trimmed.

  • 1) VirSorter2 is based on the gene called by prodigal. Prodigal ignores some base protrusions other than the first and last genes, and VirSorter2 is also ignored by default;
  • 2) Circular sequences usually split in the middle of a gene and have repeated fragments. VirSorter2 trims the repeat fragment and repairs the split gene by moving part of the gene from start to end.
  • 3) A complete virus sequence only means that the entire sequence has a significant virus signal (default score > = 0.95), but VirSorter2 still applies the end pruning step (10% of the gene at each end) to find the best virus fragment (the longest is within 95% of the peak score by default). Similarly, the "complete" sequence pruned by the end pruning step should not be interpreted as the original virus, because genes that have less impact on the score, such as unknown genes or genes shared by the host and the virus, can be pruned. If you want to not trim the complete sequence (ending with | full) and leave it to special tools such as checkV, you can use the -- keep original SEQ option.

run help documentation

-w, --working-dir PATH        Output directory
  -d, --db-dir PATH             Database directory
  -i, --seqfile PATH            fa or fq Sequence file in format (available through gzip or bz2 compress
  -l, --label TEXT              Add labels as prefixes to the output file;
                                This is useful when rerunning classifications with different filtering options
  --include-groups TEXT         Classifiers including virus groups (comma separated, no spaces in the middle);
                                The available options are: dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae 
                                [Default: dsDNAphage,ssDNA]
  -j, --jobs INTEGER            Maximum number of jobs allowed in parallel. [Default: 160]
  --min-score FLOAT             The lowest score identified as a virus [Default: 0.5]
  --min-length INTEGER          The shortest sequence required;All sequences shorter than this value will be deleted 
                                [Default: 0]
  --keep-original-seq           Keep the original sequence instead of trimming;
                                By default, untranslated regions at both ends of the identified virus sequence are trimmed;
                                The cyclic sequence is modified to eliminate the overlap between the two ends,
                                And adjust the genes that split into two ends [Default: no]
  --exclude-lt2gene             Short sequences (less than 2 genes) are not scored,
                                But those sequences with signature genes are viruses by default;
                                Use this option to exclude them [Default: no]
  --prep-for-dramv              by DRAMv Procedure preparation document[Default: no]
  --high-confidence-only        Only high confidence virus sequences are output;
                                This is equivalent to screening the final virus with the following criteria 
                                score.tsv Medium:( max_score > = 0.9)or
                                (max_score > = 0.7 also hallmark Marker gene> = 1)[Default: no]
  --hallmark-required           All virus sequences require marker genes[Default: no]
  --hallmark-required-on-short  It is necessary to have marker genes on the short virus sequence
                                (The length cut-off is"short"from template-config.yaml
                                In the file"MIN_SIZE_ALLOWED_WO_HALLMARK_GENE"set up,
                                The default is 3 kbp);This can reduce false positives at a reasonable sensitivity cost[Default: no]
  --viral-gene-required         It is necessary to annotate the virus gene to remove the hypothetical virus without annotated gene
                                sequence; This can reduce false positives at a reasonable sensitivity cost [Default: no]
  --viral-gene-enrich-off       Turning off the viral gene that invokes the full sequence of the virus is more demanding than the cellular gene;
                                When using only VirSorter2 Generate virus sequences with recognition from other tools
                                of DRAMv When entering, or when using checkV When pruning the virus sequence,
                                This is very useful[Default: no]
  --seqname-suffix-off          Turn off adding suffixes to sequence names(||full, ||{i}_partial,
                                ||lt2gene)  Please note that this may result in data from the same overlap group
                                Some sequences have the same name;When you determine the maximum value in each overlap group
                                You can use this option when there is a partial sequence[Default: no] 
  --provirus-off                After classifying the complete overlap, the extracted protophage was turned off.
                                and--max-orf-per-seq Use together, It can significantly speed up the operation 
                                [Default: no]
  --max-orf-per-seq INTEGER     Method for calculating classification features ORF Maximum number of;
                                This option can only be used in --provirus-off Use in mode;
                                If in the sequence ORF Exceeding the maximum limit,
                                Then its value is sampled here to reduce the calculation [Default: -1]
                                Can set ORF Quantity threshold
  --tmpdir TEXT                 Directory name of intermediate file
  --rm-tmpdir                   Delete intermediate file directory(--tmpdir); 
                                Each run creates more than 100 intermediate files, so,
                                It is recommended to use this file when processing more than 100 samples,
                                To avoid exceeding the user's file size[Default: no]
  --verbose                     Show detailed rule output[Default: no]
  --profile TEXT                snakemake Configuration files, for example, for cluster execution.
                                If directly added--forceall Option can be used to force a rerun
  -n, --dryrun                  Check the rules to run and the files to generate[Default: no]
  --use-conda-off               Stop using the included package conda envs (vs2.yaml),
                                And use the content installed in the current system;Only if you want to use what you like
                                Only useful when version installs dependencies on its own;This option applies to development versions [Default: no]
  -h, --help                    Display this message and exit
  
  all                           By default, this means running the entire pipeline, including 1) preprocessing, 2) annotation (feature extraction), and 3) classification.
                                The main computational bottleneck is the annotation step, which takes about 95% of CPU Time.
                                If you only want to use different score cutoff values (--min-score) Re run,
                                Then use classify Parameter can skip the annotation step and rerun only the classification step

reference:

GitHub - jiarong/VirSorter2: customizable pipeline to identify viral sequences from (meta)genomic data

Keywords: Data Analysis Data Mining

Added by tetecko81sk on Mon, 21 Feb 2022 16:23:16 +0200