Common file formats of biological information

There are many analysis software in bioinformatics, which has a variety of file formats. The following is a brief summary of some common file formats for reference at any time.

Sequence information

The first thing we need most is to store files of basic DNA, RNA or protein sequences. The most common formats are FASTA and FASTQ. The meanings of various letters in the sequence are shown in my Another article.

FASTA

fasta is often used to store gene sequence information. Each sequence consists of two parts. The header is the first line of sequence information. Starting with > is often the name or ID of the sequence, which can only occupy one line. The following is the specific sequence information, which often has multiple lines. For example:

>Escherichia-coli-MG1655
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
...

FASTQ

fastq not only stores the sequence information, but also records the quality information of the sequence, which is often used to represent various sequencing results. Each piece of sequence information has 4 lines:

  • The first line is similar to fasta, but starts with @ followed by sequence ID or name;
  • The second line is the sequence information, and there can only be one line;
  • The third line is often a fixed + character;
  • The fourth line is the quality information of the sequence, and the length is consistent with the second line. The ASCII code of each character can be converted into the quality index of the corresponding base through calculation. The calculation methods of different systems are different.

For example:

@002a8f7c-c04b-4da1-bda3-70dbcd0f255c ch=245 start_time=2021-08-10T12:13:14
TGCTTCCTGT...
+
&$$#&%$)/+...

The sequences recorded in the two documents can be nucleotide sequences or amino acid sequences. If they are nucleotide sequences, the sequence is 5 '- > 3'.

reference material:

  • https://www.jianshu.com/p/5bd5848eb596

Comparison information

Only sequence information is not enough. Sequences are often compared, and the comparison produces information corresponding to various positions.

PAF

PAF (Pairwise mApping Format) is a file format used to store sequence alignment results. It is a text file, and each line represents a record. A row must have 12 fields, divided by \ t, which are:

  1. Query sequence name;
  2. Query sequence length;
  3. Query Start, which matches the starting position of query on the;
  4. Query End, which matches the end position of the previous query;
  5. Positive and negative chains, "+" or "-";
  6. Target sequence name, the name of the reference sequence;
  7. Target sequence length, reference sequence length;
  8. Target Start, which matches the starting position of the upper reference sequence;
  9. Target End, which matches the end position of the upper reference sequence;
  10. Number of residue matches, the number of bases on the real comparison;
  11. Alignment block length, total number of bases, including match, mismatch, insertion and deletion;
  12. Mapping quality, sequence quality, 0-255, the larger the better, but 255 indicates no matching;

If there is no comparison, there are often records, but the matching information of fields 3-11 is *. After these 12 fields, there may be a custom tag field similar to SAM, such as ch:i:12. For details, see 3fbcc430-7501-4f61-84d5-d22242801f6c7 95 39 95 + Escherichia coli mg1655.fasta 4641652 234646 234707 40 62 255 ch:i:12.

reference material:

  • https://github.com/lh3/miniasm/blob/master/PAF.md

SAM/BAM/CRAM

The Sequence Alignment Map (SAM) format is used to store sequence alignment information in text format. It is divided into header (optional) and comparison part. The header starts with @ and may have multiple lines. Each line represents a type of information. The specific information category is represented by two letters, such as SQ represents the reference sequence information. After that, the specific information is in the form of key:value. For example, LN:18957 indicates that the length is 18957. For example, @ SQ SN:KM034562.G3686.1 LN:18957.

Each line of the subsequent comparison part represents a comparison result, and there may be multiple results in a read. There are the following 11 required fields to \ tseparate:

  1. QNAME String Query template NAME
  2. FLAG Int bitwise FLAG
  3. RNAME String References sequence NAME
  4. POS Int 1-based leftmost mapping position on Reference, ref start
  5. MAPQ Int Mapping Quality
  6. CIGAR String CIGAR string
  7. RNEXT String Ref. name of the mate/next read
  8. PNEXT Int Position of the mate/next read
  9. TLEN Int observed Template LENgth
  10. SEQ String query sequence
  11. Qual string ASCII of phred scaled base quality + 33, i.e. sequence quality

If there is no comparison, the RNAME in the third column is *, and all position information is 0. The above 11 required fields are followed by various non required Tags to record more information, often in the format of TAG:TYPE:VALUE. For example, NM:i:1 indicates that the number of mismatch is 1.

Binary Alignment Map (BAM) is a compressed binary type SAM file. CRAM links external reference sequence files, which are smaller than BAM files, but external reference sequence files need to be specified for compression or decompression.

reference material:

  • https://samtools.github.io/hts-specs/
  • https://samtools.github.io/hts-specs/SAMv1.pdf
  • https://www.jianshu.com/p/f0f1f293f0bd

Gene labeling information

BED

Browser Extensible Data (BED) file is a file used to store genomic regions and corresponding labels, which can be uploaded to UCSC Genome Browser Show the corresponding sections.

First, the header information is optional, or there may be multiple lines, starting with browser or track, and then various configurations of the Genome Browser.

After that, each line represents the annotation information of a genomic region, which is divided by \ t or spaces. First, there are 3 required fields:

  1. Chrome: name of chrom osome, such as chr3;
  2. chromStart: start position of chromosome, 0-based;
  3. chromEnd: end position of chromosome

Then there are 9 optional fields:
4. name: the name of the tag;
5. score: 0-1000, score;
6. strand: positive and negative chain, positive chain - '+', negative chain - '-', do not distinguish positive and negative chains - ';
7. thickStart: the starting position of the bold part;
8. thickEnd: the end position of the bold part;
9. itemRgb: color;
10. blockCount: number of exons inside and outside the region;
11. blockSizes: the length of each exon, separated by commas;
12. blockStarts: the starting position of each exon (starting with chromStart), separated by commas;

Example:

track name=pairedReads description="Clone Paired Reads" useScore=1
chr22 1000 5000 cloneA 960 + 1000 5000 0 2 567,488, 0,3512
chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

There are also several common variants of BED files:

  • bigBed is the binary format converted from bed file, which is more suitable for large data sets;
  • bedDetail, which is basically the same as the bed file, records two more field ID s and detailed descriptions;

reference material:

  • http://genome.cse.ucsc.edu/FAQ/FAQformat.html#format1

bedMethyl

bedMethyl is a variant of bed file, which is dedicated to indicating methylation status. The first 9 fields are consistent with bed file:

  1. Chrome: name of chrom osome, such as chr3;
  2. chromStart: start position of chromosome, 0-based;
  3. chromEnd: end position of chromosome
  4. Name: the name of the label;
  5. score: 0-1000, score, which is reflected in gray scale in Genome Browser;
  6. strand: positive and negative chain, positive chain - '+', negative chain - '-', do not distinguish positive and negative chains - ';
  7. thickStart: the starting position of the bold part;
  8. thickEnd: end position of the bold part;
  9. itemRgb: color;

The last two added fields represent methylation information:
10. Number of reads or Coverage;
11. Proportion of methylated reads at this position;

reference material:

  • https://www.encodeproject.org/data-standards/wgbs/

GFF

The General Feature Format (GFF) file format is similar to BED and is also used to store genomic regions and corresponding annotation information. The header information starts with browser or track, followed by various configurations of the Genome Browser.

After that, each line is marked with a gene to be \ tdivided. There are 9 required fields:

  1. seqname: name of chromosome;
  2. source: the program used to generate the annotation;
  3. feature: type of the section, such as enhancer, promoter, etc;
  4. start: the starting position of the segment on the chromosome, 1-based;
  5. End: the end position of the segment on the chromosome, 1-based;
  6. Score: a score of 0-1000, which is reflected in the gray scale in the Genome Browser. Indicates that there is no score;
  7. strand: positive and negative chain, positive chain - '+', negative chain - '-', do not distinguish positive and negative chains - ';
  8. Frame: if the region is a coding exon, the field is 0-2, indicating the starting base of the reading frame; In other types, the value is;
  9. Group: group name.

give an example:

browser position chr22:10000000-10025000
browser hide all
track name=regulatory description="TeleGene(tm) Regulatory Regions" visibility=2
chr22	TeleGene	enhancer	10000000	10001000	500	+	.	touch1
chr22	TeleGene	promoter	10010000	10010100	900	+	.	touch1
chr22	TeleGene	promoter	10020000	10025000	800	-	.	touch2

reference material:

  • http://genome.cse.ucsc.edu/FAQ/FAQformat.html#format3
  • http://gmod.org/wiki/GFF2

Wiggle/WIG

Files such as BED only mark the types and functions of each region of the genome, while Wiggle (WIG) files mark a series of values related to a region of the genome, such as probability scores, which can also be loaded and displayed by UCSC Genome Browser.

Wiggle is a text file. The first line of each part is marked with track type=wiggle_0 is used to define the type (track definition line), followed by some options to describe the name, display method and other information.

The following data part has two formats. The first is variable step format, which is used to represent areas with irregular starting positions. The first line is in the shape of variablestep chrome = Chrn [span = windowsize], marking the type, chromosome and window size respectively (optional, 1 by default). The subsequent data has multiple rows, and each row has two columns, namely chromStart dataValue, that is, the start position of the region and the value of the region. For example:

track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
49304701 10.0
49304901 12.5
49305401 15.0
49305601 17.5
49305901 20.0
49306081 17.5
49306301 15.0
49306691 12.5
49307871 10.0

The above represents the corresponding values of 9 genomic intervals, 150 bases in each interval. The first interval is 49304701 ~ 49304851, and the corresponding value is 10.0.

The other format is fixed step format, which is used to represent gene regions with fixed intervals at the start position. The first row, such as fixedstep chrome = Chrn start = POS Step = stepinterval [span = windowsize], respectively represents the type, chromosome, starting position, interval between starting positions, and window size (optional, 1 by default). After that, there are multiple rows of data, and each row is also a number, which is used to represent the value corresponding to the region. For example:

track type=wiggle_0 name="fixedStep" description="fixedStep format" visibility=full autoScale=off viewLimits=0:1000 color=0,200,100 maxHeightPixels=100:50:20 graphType=points priority=20
fixedStep chrom=chr19 start=49307401 step=300 span=200
1000
 900
 800
 700
 600
 500
 400
 300
 200
 100

The above shows that there are 10 regions with a span of 200 bases on chromosome 19. Starting from the 49307401 base, every 300 bases are the starting position of the region. These 10 regions have corresponding values.

Note that the base positions in the wiggle file are all 1-based, that is, the position of the first base is 1, and the region includes the base at the end position.

bigWig is the compressed binary Wiggle file.

reference material:

  • https://genome.ucsc.edu/goldenpath/help/wiggle.html

BedGraph

The BedGraph file is similar to the Wiggle format. It is also used to label a series of values related to a region of the genome, which can be loaded and displayed by the UCSC Genome Browser.

The first line is also the track definition line, which starts with track type=bedGraph to indicate the type, and then adds some options to describe the name, display mode, etc.

Then, in the data part, each line is marked with the value corresponding to a genome region, with four fields: chrN chrStart chrEnd dataValue. The first three fields are the same as the bed file, including chromosome name, start position and end position respectively. The last field represents the value corresponding to this region. For example:

browser position chr19:49302001-49304701
browser hide all
browser pack refGene encodeRegions
browser full altGraph
track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20
chr19 49302000 49302300 -1.0
chr19 49302300 49302600 -0.75
chr19 49302600 49302900 -0.50
chr19 49302900 49303200 -0.25
chr19 49303200 49303500 0.0
chr19 49303500 49303800 0.25
chr19 49303800 49304100 0.50
chr19 49304100 49304400 0.75
chr19 49304400 49304700 1.00

That is, the corresponding values of 9 regions on chromosome 19 are marked. The position in the BedGraph file is 0-based, that is, the position of the first base is 0, and the region does not include the base at the end position.

BedGraph files can also be compressed into binary bigWig files, but it is difficult to convert with Wiggle files,

reference material:

  • https://genome.ucsc.edu/goldenPath/help/bedgraph.html

Gene variation

VCF

The Variant Call Format (VCF) file format is used to record the difference between the sequencing results and the reference genome, that is, the variation.

If you want the file to be loaded and displayed by UCSC Genome Browser, the first line needs to be track definition line, which starts with track type=vcf to indicate the type, and then add some options to describe the name, display mode, etc.

After that is the header part, which may have multiple lines, all beginning with ## and recording some meta information, such as file type, date, reference genome, etc.

Then there is the header row, which starts with # and marks the column name of each column of data. The required fields often have #CHROM POS ID REF ALT QUAL FILTER INFO. After that, a variation information is recorded in each row, and the specific content is consistent with the column name:

  1. CHROM: chromosome ID;
  2. POS: the position of variation, that is, the position of the first base in column 4 REF;
  3. ID: if the variation of this position has been recorded in the variation database, this field is the ID of the record; otherwise, it is;
  4. REF: base sequence of the reference sequence at the mutation position;
  5. ALT: the base sequence of the sequencing result at the mutation position;
  6. QUAL: indicates the mass fraction of variant calling;
  7. FILTER: indicates the reliability evaluation of the variation. Commonly used PASS indicates that it meets the conditions, or other markers indicate that it fails to PASS the screening;
  8. INFO: Supplementary information;

give an example:

track type=vcf name="vcf example" description="three samples in a vcf" db=hg18 visibility="full"
browser position chr20:1-1306000
##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	microsat1	GTC	G,GTCT	50	PASS	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

There are a lot of abbreviations in VCF files. Please refer to them for their detailed meaning Official documents.

reference material:

  • https://www.jianshu.com/p/34c1e22c92c8

Keywords: Bioinformatics

Added by Hagbard on Mon, 22 Nov 2021 15:12:55 +0200