Using VCF files

Short description of the format

The Variant Call Format is designed to store information relative to genome-wide diversity data. The format consists in a header containing meta-information (in lines prefixed by ##) followed by a single header providing the list of samples included in the file, and by the body of the file which consists in, typically, a very large number of lines each describing variation for a given variant (a variant can be a single nucleotide polymorphism, an insertion/deletion, a microsatellite, or any form of genomic variation, including large rearrangements.

An example appears below:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS  ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
20  14370   rs6054257 G     A       29      PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51       1/1:43:5:.,.
20  17330   .       T       A       3       q10     NS=3;DP=11;AF=0.017     GT:GQ:DP:HQ     0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3
20  1110696 rs6040355       A       G,T     67      PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB       GT:GQ:DP:HQ     1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4
20  1230237 .       T       .       47      PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ     0|0:54:7:56,60  0|0:48:4:51,51  0/0:61:220      1234567 microsat1       GTC     G,GTCT  50      PASS    NS=3;DP=9;AA=G  GT:GQ:DP        0/1:35:4        0/2:17:2        1/1:40:3

Pieces of information are attached to each variant (site) and, within a variant, to each sample. The former are denoted INFO and the latter FORMAT. In the example above, an example of INFO field is NS (whose value is 3 for the first site), and an exemple of FORMAT field is GT (whose value for the samples of the first sites are: 0|0, 1|0, and 1|1).

A major difficulty posed by the VCF format is its very flexibility. Users have a wide liberty to define their own types of Info/Format fields, in addition to (or replacing) those who are supposed to be built-in by default in the format. There are commonly used encoding conventions, but they remain conventions there is not guarantee about how data will be represented in VCF data (we’ll be a little bit more explicit in a very short while).

There is more information in the documentation of the class io.VcfParser, including the link to the formal description of the format.

Reading VCF files

Assuming the example VCF file above has been saved in an uncompressed file named example.vcf, loading it won’t be difficult. You just need to provide the class’s constructor with the name of the file.

Actually, the constructor of the class does not read further than the header. As a result, only the meta-information present in the header and the list of samples will be known to the instance at this point. The property num_samples and the method get_sample() let you get the list of sample names:

>>> vcf = egglib.io.VcfParser('example.vcf')
>>> print([vcf.get_sample(i) for i in range(vcf.num_samples)])
['NA00001', 'NA00002', 'NA00003']

The meta-information properties attached to the file can be accessed using the same model as the sample names (one property and one getter method taking an index), as listed below for the different categories of meta-information:

Code

Type of meta-information

Counter property

Accessor method

ALT

Alternative allele code

num_alt

get_alt()

FILTER

Test used to filter files

num_filter

get_filter()

FORMAT

Descriptor of sample data

num_format

get_format()

INFO

Descriptor of variant data

num_info

get_info()

META

Other meta-information

num_meta

get_meta()

The last category, META, represents all meta-information lines with a custom key (other than ALT, FILTER, FORMAT, and INFO). To collect all user-defined META entries as a dictionary, use the following expression:

>>> print(dict([vcf.get_meta(i) for i in range(vcf.num_meta)]))
{'fileDate': '20090805', 'source': 'myImputationProgramV3.1', 'reference': 'file:///seq/references/1000GenomesPilot-NCBI36.fasta', 'contig': '<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>', 'phasing': 'partial'}

Iterating over variants

As said before, the loading of a file does not import any of the proper variant lines. Due to the potentially large size of VCF files, the VCF parser follows an iterative scheme where lines are read one after another. We we iterate over a io.VcfParser instance, we can observe that, at each step, they don’t return the full data corresponding to the current variant:

>>>   for ret in vcf:
...       print(ret)
...
('20', 14369, 2)
('20', 17329, 2)
('20', 1110695, 3)
('20', 1230236, 1)
('20', 1234566, 3)

The three items of the iterator are, in order: the chromosome name, the position (0 being the first position of the chromosome), and the number of alleles (including the reference allele).

Obviously, this system does not expose much of the actual data present on each line of the VCF file, but the aim is to avoid extracting (and converting to Python objects) the full data until explicitly requested.

At any point during iteration, the io.VcfParser instance (vcf in the example above) can be used to extract data corresponding to the current site (the one read at the last iteration loop). There are three methods, presented in the following three paragraphs: extracting site, extracting frequencies, and getting an object representing the whole variant (including all data attached to this variant).

Besides, it is also possible to iterate manually (reading variants one by one without a for loop) using the global function next():

>>> while vcf.good:
...     print(next(vcf))
...
('20', 14369, 2)
('20', 17329, 2)
('20', 1110695, 3)
('20', 1230236, 1)
('20', 1234566, 3)

If next(vcf) is called again when vcf.good is False, then a StopIteration iteration is thrown (which is the standard behaviour for the impletation of iterable types in Python).

Extracting a site from a VCF

Data for the current site of a io.VcfParser instance can be extracted as a Site instance using either site_from_vcf() or the from_vcf() method of Site instances, provided that the VCF file has called genotypes encoded using the GT FORMAT field:

>>> vcf = egglib.io.VcfParser('example.vcf')
>>> print(vcf.next())
('20', 14369, 2)
>>> site = egglib.site_from_vcf(vcf)
>>> print(site.as_list())
['G', 'G', 'A', 'G', 'A', 'A']

>>> print(vcf.next())
('20', 17329, 2)
>>> print(vcf.next())
('20', 1110695, 3)
>>> print(vcf.next())
('20', 1230236, 1)
>>> print(vcf.next())
('20', 1234566, 3)
>>> site.from_vcf(vcf)
>>> print site.as_list()
['G', 'T', 'T', 'G', 'T', 'T']

Extracting frequencies from a VCF

Similarly, one can extract allelic frequencies as a stats.Freq instance using stats.freq_from_vcf() or the from_vcf(), method of stats.Freq instances, provided that the VCF file has frequency information encoded using the AN and AC INFO fields:

>>> vcf = egglib.io.VcfParser('example.vcf')
>>> print(vcf.next())
('20', 14369, 2)
>>> frq = egglib.stats.freq_from_vcf(vcf)
>>> print(frq.freq_allele(0), frq.freq_allele(1))
3 3

>>> print(vcf.next())
('20', 17329, 2)
>>> print(vcf.next())
('20', 1110695, 3)
>>> print(vcf.next())
('20', 1230236, 1)
>>> print(vcf.next())
('20', 1234566, 3)
>>> frq.from_vcf(vcf)
>>> print(frq.freq_allele(0), frq.freq_allele(1), frq.freq_allele(2))
2 3 1

We can see that Site from VCF extraction and stats.Freq from VCF extraction are consistent.

Getting a variant as an object

To extract data manually for a given site, it is also possible to get all data at once. There is a get_variant() method that returns an instance of a special type (io.Variant). This is a proxy class, just like SampleView for Align classes, and the same precautions must be taken while using it. Objects of the class Variant provide a number of properties and methods that allow to read all desired data. We will just show a single example. The VCF file we use has a HQ FORMAT field (haplotype quality). We will extract it for each sample in a loop:

>>> vcf = egglib.io.VcfParser('example.vcf')
>>> for chrom, pos, nall in vcf:
...     v = vcf.get_variant()
...     if 'HQ' in v.format_fields:
...         print([i['HQ'] for i in v.samples])
...     else:
...         print('no data')
...
[(51, 51), (51, 51), (None, None)]
[(58, 50), (65, 3), None]
[(23, 27), (18, 2), None]
[(56, 60), (51, 51), None]
no data

For each variant, we first tested that HQ is present in the FORMAT fields for this variant (in one instance, it is not the case). If so, it is extracted from the list of dictionaries provided as the property samples.

Reading compressed files or arbitrary data

The class io.VcfParser is not able to read compressed files, requiring that you decompress them before you read them. This can be done “on the fly” using embedded facilities of the Python language.

Let’s first contemplate the problem, where example.vcf.gz is a gunzip-compressed version of the same example VCF file:

>>> vcf = egglib.io.VcfParser('example.vcf.gz')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.8/site-packages/egglib/io/_vcf.py", line 356, in __init__
    self._parser.open_file(fname)
OSError: cannot parse "VCF" data from "example.vcf.gz" (line 1): first character of file is not "#" as expected [char: ]

A strange character (irrelevant by itself) should be shown at the end of the error message. This strange character is a direct consequence of providing compressed data when the parser expects uncompressed data. The solution, besides merely uncompressing the file and importing the resulting uncompressed files, is shown below:

>>> import gzip
>>> f = gzip.open('example.vcf.gz','rt')
>>> cache = []
>>> while True:
...     line = f.readline()
...     if line[:2] == '##': cache.append(line)
...     elif line[:1] == '#':
...         cache.append(line)
...         break
...     else: raise IOError('invalid file')
...
>>> header = ''.join(cache)
>>> vcf = egglib.io.VcfStringParser(header)
>>> site = egglib.Site()
>>> for line in f:
...     print(vcf.readline(line))
...     site.fom_vcf(vcf)
...     print(site.as_list())
...
('20', 14369, 2)
['G', 'G', 'A', 'G', 'A', 'A']
('20', 17329, 2)
['T', 'T', 'T', 'A', 'T', 'T']
('20', 1110695, 3)
['G', 'T', 'T', 'G', 'T', 'T']
('20', 1230236, 1)
['T', 'T', 'T', 'T', 'T', 'T']
('20', 1234566, 3)
['GTC', 'G', 'GTC', 'GTCT', 'G', 'G']

The gzip module lets you read gunzip–compressed data using a file-like interface. The only tricky part is to import the header (all lines starting with ## up to the one starting with a single #). The header line are stored in the cache list then concatenated back into a str.

io.VcfStringParser allows to process VCF data from user-provided strings. The constructor expects the full header as a single string, and a specific method readline() expects each variant line as a string.

This class useful in cases such as this—where the data are imported from a non-trivial source—or in cases where the user wants to generate a header dynamically. For the rest, this class offers the same functionality than io.VcfParser.

Each line can be provided as a string through the method readline(), which is identical to calling next() on io.VcfParser instances, except that the data are provided from the outside. Of course, it is required that the number of samples is constant throughout the data which are provided, and that all INFO or FORMAT fields are declared and properly used, as for any normal file.