DAS_ Learn notes from scratch

reference resources https://github.com/cmks/DAS_Tool

DAS: dereplication, aggregation and scoring strategy

DAS Tool can integrate the bins obtained from different macrogenomes, obtain more high-quality, high integrity and non redundant bins, and better show the differences between strain variation microbial strains.

When input ting DAS Tool, you can select as many binning methods as possible. Even some binning methods that only obtain few high-quality bins may also obtain some bins ignored by other methods.

ABAWACA performs a hierarchical clustering on tetranucleotide frequencies and differential coverage, and takes marker genes into account. CONCOCT uses Gaussian mixture models and tetranucleotides frequencies with differential coverage9 . MaxBin 2 is based on an expectation-maximization algorithm and uses tetranucleotides, differential coverage and marker genes13. MetaBAT applies a k-medoid clustering on tetranucleotide frequencies and differential coverage. (quoted from) Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy | Nature Microbiology

​ 

The core idea of DAS is the iteration of judging bin quality based on single copy gene score.

Step 1: the input file of the DAS tool includes the scaffolds sequence (represented by gray lines) in the splicing results and the bins set obtained from different binning tools (rounded rectangles of the same color represent bins obtained by the same binning method);

Step 2: predict the single copy gene of scaffold in each bin (represented by blue shape) and score;

Step 3: in all results, merge the same bins as an alternative set of this bin;

Step 4: iteratively select high score bins and update the scores of candidate bins in the rest of the set. If the scores are the same, select the bin with higher scaffold N50 value. N50 value: the minimum conting length required to cover 50% of the genome

Several reads are obtained by sequencing. These reads are spliced. If they can be spliced completely, the sequence without gap in the middle is called conting, which means continuous. If there is a gap in the middle, but the length of the gap can be known, such a sequence is called scaffold, The meaning of scaffold (discontinuous). Arrange conting and scaffold from long to short, and then add them. When they are just added to 50% of 1M, that is, 500k, the length of that conting or scaffold is called Contig N50 and Scaffold N50. Obviously, the larger this value is, the better the assembly quality is
That is, count down from the longest segment to the segment with a length of half of the total length. The longer the last segment is counted, the more long segments, and the better the quality of the final assembly.  
Quoted from What do n50 and N90 mean in genome sequencing_ Mr tomato egg blog - CSDN blog_ What does n50 mean

The final output includes non redundant high score bins (score greater than threshold t) predicted from different input files.

CheckM first constructs the evolutionary tree of the genome based on the complete sequenced bacterial genome as the reference genome, Construct a single copy gene set for each lineage (which can be understood as a species) (single copy genes, SCGs, why single copy? Because it can evaluate the degree of genomic mixing, pollution, etc.). When using, build a tree with Bin and the reference genome, find the reference species of Bin based on the evolutionary relationship, and then calculate two important indicators in combination with the single copy gene set of the reference species. Completeness, Bin gene and pair Whether the number of genes is complete compared with SCGs, the value is [0100%]. The larger the value, the better the quality of Bin; Contamination, pollution degree, Bin gene contains SCGs of multiple species, that is, the degree of multiple species in a Bin. The value is [0100%]. The smaller the value, the better the quality of Bin.

Practical operation

DAS_Tool -i methodA.scaffolds2bin,...,methodN.scaffolds2bin
         -l methodA,...,methodN -c contigs.fa -o myOutput

   -i, --bins                 Comma separated list of tab separated scaffolds to bin tables.
   -c, --contigs              Contigs in fasta format.
   -o, --outputbasename       Basename of output files.
   -l, --labels               Comma separated list of binning prediction names. (optional)
   --search_engine            Engine used for single copy gene identification [blast/diamond/usearch].
                              (default: usearch)
   --write_bin_evals          Write evaluation for each input bin set [0/1]. (default: 1)
   --create_plots             Create binning performance plots [0/1]. (default: 1)
   --write_bins               Export bins as fasta files  [0/1]. (default: 0)
   --proteins                 Predicted proteins in prodigal fasta format (>scaffoldID_geneNo).
                              Gene prediction step will be skipped if given. (optional)
   --score_threshold          Score threshold until selection algorithm will keep selecting bins [0..1].
                              (default: 0.5)
   --duplicate_penalty        Penalty for duplicate single copy genes per bin (weight b).
                              Only change if you know what you're doing. [0..3]
                              (default: 0.6)
   --megabin_penalty          Penalty for megabins (weight c). Only change if you know what you're doing. [0..3]
                              (default: 0.5)
   --db_directory             Directory of single copy gene database. (default: install_dir/db)
   --resume                   Use existing predicted single copy gene files from a previous run [0/1]. (default: 0)
   --debug                    Write debug information to log file.
   -t, --threads              Number of threads to use. (default: 1)
   -v, --version              Print version number and exit.
   -h, --help                 Show this message.

-i input the bins results obtained by different binning methods. The file format is tabular scaffolds 2bin file, including tab separated scaffold IDs and bin IDs.

Scaffold_1	bin.01
Scaffold_8	bin.01
Scaffold_42	bin.02
Scaffold_49	bin.03

-l binning method corresponding to - i input file one by one

-c assembled Contag's fasta file

-o output files to the specified folder, including DASTool_summary.txt (output bins and estimation of its quality and integrity); dastool_scaffolds 2bin.txt (output bins and its corresponding scaffolds)

--search_engine

The search method for single copy gene recognition is usearch by default, including blast and diamond

--write_bin_evals

Evaluate each input bin set

--write_bins

Output bins as fasta file

--proteins

fasta format of protein predicted by Prodigal

--score_threshold

Select the threshold for bin

Keywords: Python Back-end

Added by binumathew on Mon, 03 Jan 2022 21:42:44 +0200