Using VCF files¶
Short description of the format¶
The Variant Call Format is designed to store information relative to genome-wide
diversity data. The format consists in a header containing meta-information
(in lines prefixed by ##
) followed by a single header providing the list of
samples included in the file, and by the body of the file which consists in,
typically, a very large number of lines each describing variation for a given
variant (a variant can be a single nucleotide polymorphism, an insertion/deletion,
a microsatellite, or any form of genomic variation, including large rearrangements.
An example appears below:
##fileformat=VCFv4.4
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Pieces of information are attached to each variant (site) and, within a variant,
to each sample. The former are denoted INFO
and the latter FORMAT
. In the
example above, an example of INFO
field is NS
(whose value is 3 for the first
site), and an exemple of FORMAT
field is GT
(whose value for the samples of
the first sites are: 0|0
, 1|0
, and 1|1
).
The description of the VCF format is available here.
Reading VCF files¶
EggLib provides two alternative parsers in the io module:
VcfParser
and VCF
.
The former is essentially there as a fallback solution in case the,
latter, which depends on the C library htslib
, is not available. Refer
to the installation Install for installation. If the dependency
is fulfilled at installation, the VCF
class will be available.
If not, attempting to use it will cause an exception.
It can be tested using a variable exposed in EggLib:
>>> print(egglib.config.htslib)
1
(This will return 0 or 1.). It can be also tested from the command line:
$ EggLib version 3.2.0
Installation path: /home/stephane/.local/lib/python3.9/site-packages/egglib/wrappers
External application configuration file: /home/stephane/.config/EggLib/apps.conf
debug flag: 0
htslib flag: 1
version of muscle: 5
The VCF
class offers a number of advantages:
It is based on htslib, the underlying library of the
samtools
andbcftools
programs, making it the de facto standard for parsing VCF/BCF files.VcfParser
is based on a native implementation which can differ occasionally (often by being more restrictive and complaining about the format).It can import both compressed and uncompressed VCF and BCF files. With
VcfParser
, the user is required to provide uncompressed VCF file, which can be a huge bottleneck.It is expected to be significantly more efficient, especially for direct reading of BCF data.
Using default parser VCF
¶
Opening a file¶
To open a file with the VCF
class, pass the name of a
compressed or uncompressed VCF or BCF file as in:
>>> vcf = egglib.io.VCF('example.vcf')
>>> print(vcf.get_samples())
['NA00001', 'NA00002', 'NA00003']
Immediately after opening the file, no data has been accessed; all
accessors will return None
(except header data):
>>> print(vcf.get_pos())
None
Iterating on positions¶
The next position (or variant) is read using the read()
method, which returns a boolean. If the boolean if True
, data has
been read and can be accessed. If (and only if) the end of file is
reached, read()
returns False
. To loop over the whole content
of the file, just write:
>>> while vcf.read():
... print(vcf.get_chrom(), vcf.get_pos())
...
20 14369
20 17329
20 1110695
20 1230236
20 1234566
Indexing¶
Indexing allows arbitrary and linear-time navigation within BCF files.
(not available for VCF files). Index files generated by bcftools
are
supported, while the function io.index_vcf()
can be used to
generate a BCF index.
To demonstrate the use of indexes, we will use a BCF file which we will index before importing it:
>>> egglib.io.index_vcf('data.bcf')
>>> vcf = egglib.io.VCF('data.bcf')
>>> print(vcf.has_index)
True
The index file is named after the BCF file (with a “.csi” suffix). By
default, index_vcf()
and VCF
use the same format. If
the index is named differently (e.g. located in a different directory),
its name can be specified as the index option of the VCF
constructor:
>>> egglib.io.index_vcf('data.bcf', outname='another_name.csi')
>>> vcf = egglib.io.VCF('data.bcf', index='another_name.csi')
>>> print(vcf.has_index)
True
Extracting data¶
There a number of accessors allowing to extract data from the current position or variant.
To get the dictionary of all INFO
fields attached to the current
position, one can use get_infos()
, and
get_info()
to get a specific field:
>>> print(vcf.get_infos())
{'AA': 'A', 'TRI': [1.0, 2.0], 'ALT': 'C,G,T', 'GOOD': True}
>>> print(vcf.get_info('AA'))
A
To get the values of all FORMAT
fields for all samples, the method
get_formats()
can be used. It returns a list
(one item per sample) of dict
which all share the same set of
keys. The following gives an example which might betray the lack of
imagination of the author of the test file:
>>> print(vcf.get_formats())
[{'TEST5': '.', 'TEST1': 702}, {'TEST5': 'nothing', 'TEST1': 703}, {'TEST5': 'not more', 'TEST1': 704}, {'TEST5': 'something!', 'TEST1': 705}]
Another crucial accessor method is get_genotypes()
, which
returns a list
of genotypes. In this list
, each
sample is represented by a list
of alleles, based on its
ploidy. To transfer this structure to an EggLib’s Site
, one
must flatten the list and subsequently generate a Structure
object to analyse the object with its ploidy with the stats module:
>>> print(vcf.is_snp())
True
>>> genotypes = vcf.get_genotypes()
[['A', 'A', 'A'], ['A', 'A', 'A'], ['A', 'A', 'C'], ['A', 'C', 'C']]
>>> print(genotypes)
>>> site = egglib.site_from_list([j for i in genotypes for j in i],
... alphabet = egglib.alphabets.DNA)
>>> struct = egglib.struct_from_samplesizes([4], ploidy=3)
The code uses the is_snp()
method to check if the current
site is a proper SNP, guaranteeing that the DNA
can
be used. Next the site is extracted and converted to a Site
object. The list comprehension with two for
statements
([j for i in genotypes for j in i]
) is the way to flatten a sequence
is Python. The last line is a reminder how a Structure
object
with known sample size and known (and constant) ploidy can be created.
Using the fallback parser VcfParser
¶
Opening a file¶
Assuming the example VCF file above has been saved in an uncompressed
file named example.vcf
, you need to provide the class’s constructor
with the name of the file. As a result, only the meta-information
present in the header and the list of samples will be known to the
instance at this point. The property
num_samples
and the method
get_sample()
let you get the list of sample
names:
>>> vcf = egglib.io.VcfParser('example.vcf')
>>> print([vcf.get_sample(i) for i in range(vcf.num_samples)])
['NA00001', 'NA00002', 'NA00003']
The meta-information properties attached to the file can be accessed using the same model as the sample names (one property and one getter method taking an index), as listed below for the different categories of meta-information:
Code |
Type of meta-information |
Counter property |
Accessor method |
---|---|---|---|
|
Alternative allele code |
||
|
Test used to filter files |
||
|
Descriptor of sample data |
||
|
Descriptor of variant data |
||
|
Other meta-information |
The last category, META
, represents all meta-information lines with a custom key (other
than ALT
, FILTER
, FORMAT
, and INFO
). To collect all user-defined
META
entries as a dictionary, use the following expression:
>>> meta = dict([vcf.get_meta(i) for i in range(vcf.num_meta)])
>>> print(meta)
{'fileDate': '20090805', 'source': 'myImputationProgramV3.1', 'reference': 'file:///seq/referen
ces/1000GenomesPilot-NCBI36.fasta', 'contig': '<ID=20,length=62435964,assembly=B36,md5=f126cdf8
a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>', 'phasing': 'partial'}
Reading variants¶
Due to the potentially large size of VCF files, the VCF parser follows
an iterative scheme where lines are read one after another, only
keeping the current one in memory. When iterating over a
io.VcfParser
instance, the returned values are the
chromosome name, the position (0 being the first position of the
chromosome), and the number of alleles (including the reference
allele):
>>> for ret in vcf:
... print(ret)
...
('20', 14369, 2)
('20', 17329, 2)
('20', 1110695, 3)
('20', 1230236, 1)
('20', 1234566, 3)
It is also possible to iterate manually (reading variants one by one
without a for
loop) using the global function next()
:
>>> vcf.rewind()
>>> while vcf.good:
... print(next(vcf))
...
('20', 14369, 2)
('20', 17329, 2)
('20', 1110695, 3)
('20', 1230236, 1)
('20', 1234566, 3)
(rewind()
is a method to go back at the beginning
of the file.) If next(vcf)
is called again when vcf.good
is
False
, then a StopIteration
iteration is thrown (which is
the standard behaviour for the implementation of iterable types in
Python).
Importing a site¶
Data for the current site of a VcfParser
instance can be
extracted as a Site
instance using either the function
site_from_vcf()
or the instance method Site.from_vcf()
,
provided that the VCF file has called genotypes encoded using the
GT
FORMAT field:
>>> vcf = egglib.io.VcfParser('example.vcf')
>>> print(next(vcf))
('20', 14369, 2)
>>> site = egglib.site_from_vcf(vcf)
>>> print(site.as_list())
['G', 'G', 'A', 'G', 'A', 'A']
>>> print(next(vcf))
('20', 17329, 2)
>>> site.from_vcf(vcf)
>>> print(site.as_list())
['T', 'T', 'T', 'A', 'T', 'T']
Importing frequencies¶
For your information, one can extract allelic frequencies as a
Freq
instance using freq_from_vcf()
or
Freq.from_vcf()
, provided that the VCF file has frequency
information encoded using the AN
and AC
INFO fields, which is
not the case for our example file.
Getting a variant as an object¶
To extract data manually for a given site, it is also possible to get
all data at once. There is a get_variant()
method
that returns an instance of a special type (io.VcfVariant
).
This is a proxy class, just like SampleView
. Objects of the
class VcfVariant
provide a number of properties and methods
that allow to read all desired data. We will just show a single
example. The VCF file we use has a HQ
FORMAT field (haplotype
quality). We will extract it for each sample in a loop:
>>> vcf = egglib.io.VcfParser('example.vcf')
>>> for chrom, pos, nall in vcf:
... v = vcf.get_variant()
... if 'HQ' in v.format_fields:
... print([i['HQ'] for i in v.samples])
... else:
... print('no data')
...
[(51, 51), (51, 51), (None, None)]
[(58, 50), (65, 3), None]
[(23, 27), (18, 2), None]
[(56, 60), (51, 51), None]
no data
For each variant, we first tested that HQ
is present in the FORMAT
fields for this variant (in one instance, it is not the case). If so,
it is extracted from the list of dictionaries provided as the property
samples
.