.. _manual-vcf: --------------- Using VCF files --------------- Short description of the format =============================== The Variant Call Format is designed to store information relative to genome-wide diversity data. The format consists in a header containing meta-information (in lines prefixed by ``##``) followed by a single header providing the list of samples included in the file, and by the body of the file which consists in, typically, a very large number of lines each describing variation for a given *variant* (a variant can be a single nucleotide polymorphism, an insertion/deletion, a microsatellite, or any form of genomic variation, including large rearrangements. An example appears below: .. code-block:: none ##fileformat=VCFv4.4 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig= ##phasing=partial ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##FILTER= ##FILTER= ##FORMAT= ##FORMAT= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3 Pieces of information are attached to each variant (site) and, within a variant, to each sample. The former are denoted ``INFO`` and the latter ``FORMAT``. In the example above, an example of ``INFO`` field is ``NS`` (whose value is 3 for the first site), and an exemple of ``FORMAT`` field is ``GT`` (whose value for the samples of the first sites are: ``0|0``, ``1|0``, and ``1|1``). The description of the VCF format is available `here `_. Reading VCF files ================= EggLib provides two alternative parsers in the :ref:`io ` module: :class:`~.io.VcfParser` and :class:`~.io.VCF`. The former is essentially there as a fallback solution in case the, latter, which depends on the C library ``htslib``, is not available. Refer to the installation :ref:`install` for installation. If the dependency is fulfilled at installation, the :class:`!VCF` class will be available. If not, attempting to use it will cause an exception. It can be tested using a variable exposed in EggLib:: >>> print(egglib.config.htslib) 1 (This will return 0 or 1.). It can be also tested from the command line: .. code-block:: none :emphasize-lines: 5 $ EggLib version 3.2.0 Installation path: /home/stephane/.local/lib/python3.9/site-packages/egglib/wrappers External application configuration file: /home/stephane/.config/EggLib/apps.conf debug flag: 0 htslib flag: 1 version of muscle: 5 The :class:`!VCF` class offers a number of advantages: * It is based on htslib, the underlying library of the ``samtools`` and ``bcftools`` programs, making it the *de facto* standard for parsing VCF/BCF files. :class:`!VcfParser` is based on a native implementation which can differ occasionally (often by being more restrictive and complaining about the format). * It can import both compressed and uncompressed VCF and BCF files. With :class:`!VcfParser`, the user is required to provide uncompressed VCF file, which can be a huge bottleneck. * It is expected to be significantly more efficient, especially for direct reading of BCF data. Using default parser :class:`!VCF` ================================== Opening a file -------------- To open a file with the :class:`~.io.VCF` class, pass the name of a compressed or uncompressed VCF or BCF file as in:: >>> vcf = egglib.io.VCF('example.vcf') >>> print(vcf.get_samples()) ['NA00001', 'NA00002', 'NA00003'] Immediately after opening the file, no data has been accessed; all accessors will return ``None`` (except header data):: >>> print(vcf.get_pos()) None Iterating on positions ---------------------- The next position (or variant) is read using the :meth:`~.io.VCF.read` method, which returns a boolean. If the boolean if ``True``, data has been read and can be accessed. If (and only if) the end of file is reached, :meth:`!read` returns ``False``. To loop over the whole content of the file, just write:: >>> while vcf.read(): ... print(vcf.get_chrom(), vcf.get_pos()) ... 20 14369 20 17329 20 1110695 20 1230236 20 1234566 Indexing -------- Indexing allows arbitrary and linear-time navigation within BCF files. (not available for VCF files). Index files generated by ``bcftools`` are supported, while the function :func:`.io.index_vcf` can be used to generate a BCF index. To demonstrate the use of indexes, we will use a BCF file which we will index before importing it:: >>> egglib.io.index_vcf('data.bcf') >>> vcf = egglib.io.VCF('data.bcf') >>> print(vcf.has_index) True The index file is named after the BCF file (with a ".csi" suffix). By default, :func:`!index_vcf` and :class:`!VCF` use the same format. If the index is named differently (e.g. located in a different directory), its name can be specified as the *index* option of the :class:`!VCF` constructor:: >>> egglib.io.index_vcf('data.bcf', outname='another_name.csi') >>> vcf = egglib.io.VCF('data.bcf', index='another_name.csi') >>> print(vcf.has_index) True Navigation with :meth:`!goto` ----------------------------- Provided that an index is loaded, the :meth:`~.io.VCF.goto` method allows to move anywhere in the file. To demonstrate the use of :meth:`!goto`, consider the table of positions actually available in ``data.bcf``: +------+------+ | ctg1 | 1000 | +------+------+ | ctg1 | 1001 | +------+------+ | ctg1 | 1010 | +------+------+ | ctg1 | 1011 | +------+------+ | ctg2 | 1015 | +------+------+ | ctg2 | 1016 | +------+------+ | ctg2 | 1020 | +------+------+ | ctg2 | 1030 | +------+------+ | ctg2 | 1050 | +------+------+ | ctg3 | 1060 | +------+------+ | ctg3 | 1100 | +------+------+ When we open the file, reading it will extract the first line:: >>> egglib.io.index_vcf('data.bcf') >>> vcf.read() >>> print(vcf.get_chrom(), vcf.get_pos()) ctg1 999 Remember about the 1-position offset caused by the conversion of genomic positions to pythonic indexes applied by EggLib. :meth:`!goto` can be used to navigate, either back or forth, within the same contig:: >>> vcf.goto('ctg2', 1019) >>> print(vcf.get_chrom(), vcf.get_pos()) ctg2 1019 >>> vcf.goto('ctg1', 1009) >>> print(vcf.get_chrom(), vcf.get_pos()) ctg1 1009 If the position is omitted, the first available position of the contig is picked: >>> vcf.goto('ctg2') >>> print(vcf.get_chrom(), vcf.get_pos()) ctg2 1014 If the contig does not exist, the move fails and an :class:`ValueError` is raised:: >>> vcf.goto('ctg4') Traceback (most recent call last): File "/home/stephane/data/devel/egglib/project/doc/manual/test/stats-4.py", line 78, in vcf.goto('ctg4', 1000) ValueError: unknown target name: ctg4 And also if the position does not exist in the specified contig:: >>> vcf.goto('ctg3', 1000) Traceback (most recent call last): File "/home/stephane/data/devel/egglib/project/doc/manual/test/stats-4.py", line 80, in vcf.goto('ctg3', 1000) ValueError: position not found: 1000 However, it is possible to let :meth:`!goto` move to the first available position within a specified range by using the option *limit*: >>> vcf.goto('ctg3', 1000, limit=1100) >>> print(vcf.get_chrom(), vcf.get_pos()) ctg3 1059 Extracting data --------------- There a number of accessors allowing to extract data from the current position or variant. To get the dictionary of all ``INFO`` fields attached to the current position, one can use :meth:`~.io.VCF.get_infos`, and :meth:`~.io.VCF.get_info` to get a specific field:: >>> print(vcf.get_infos()) {'AA': 'A', 'TRI': [1.0, 2.0], 'ALT': 'C,G,T', 'GOOD': True} >>> print(vcf.get_info('AA')) A To get the values of all ``FORMAT`` fields for all samples, the method :meth:`~.io.VCF.get_formats` can be used. It returns a :class:`!list` (one item per sample) of :class:`!dict` which all share the same set of keys. The following gives an example which might betray the lack of imagination of the author of the test file:: >>> print(vcf.get_formats()) [{'TEST5': '.', 'TEST1': 702}, {'TEST5': 'nothing', 'TEST1': 703}, {'TEST5': 'not more', 'TEST1': 704}, {'TEST5': 'something!', 'TEST1': 705}] Another crucial accessor method is :meth:`~.io.VCF.get_genotypes`, which returns a :class:`!list` of genotypes. In this :class:`!list`, each sample is represented by a :class:`!list` of alleles, based on its ploidy. To transfer this structure to an EggLib's :class:`.Site`, one must flatten the list and subsequently generate a :class:`.Structure` object to analyse the object with its ploidy with the :ref:`stats ` module:: >>> print(vcf.is_snp()) True >>> genotypes = vcf.get_genotypes() [['A', 'A', 'A'], ['A', 'A', 'A'], ['A', 'A', 'C'], ['A', 'C', 'C']] >>> print(genotypes) >>> site = egglib.site_from_list([j for i in genotypes for j in i], ... alphabet = egglib.alphabets.DNA) >>> struct = egglib.struct_from_samplesizes([4], ploidy=3) The code uses the :meth:`~.io.VCF.is_snp` method to check if the current site is a proper SNP, guaranteeing that the :obj:`~.alphabets.DNA` can be used. Next the site is extracted and converted to a :class:`!Site` object. The list comprehension with two ``for`` statements (``[j for i in genotypes for j in i]``) is the way to flatten a sequence is Python. The last line is a reminder how a :class:`!Structure` object with known sample size and known (and constant) ploidy can be created. Using the fallback parser :class:`!VcfParser` ============================================= Opening a file -------------- Assuming the example VCF file above has been saved in an uncompressed file named ``example.vcf``, you need to provide the class's constructor with the name of the file. As a result, only the meta-information present in the header and the list of samples will be known to the instance at this point. The property :py:obj:`~.io.VcfParser.num_samples` and the method :meth:`~.io.VcfParser.get_sample` let you get the list of sample names:: >>> vcf = egglib.io.VcfParser('example.vcf') >>> print([vcf.get_sample(i) for i in range(vcf.num_samples)]) ['NA00001', 'NA00002', 'NA00003'] The meta-information properties attached to the file can be accessed using the same model as the sample names (one property and one getter method taking an index), as listed below for the different categories of meta-information: +---------------+----------------------------+-------------------------------------+-----------------------------------+ | Code | Type of meta-information | Counter property | Accessor method | +===============+============================+=====================================+===================================+ | ``ALT`` | Alternative allele code | :py:obj:`~.io.VcfParser.num_alt` | :meth:`~.io.VcfParser.get_alt` | +---------------+----------------------------+-------------------------------------+-----------------------------------+ | ``FILTER`` | Test used to filter files | :py:obj:`~.io.VcfParser.num_filter` | :meth:`~.io.VcfParser.get_filter` | +---------------+----------------------------+-------------------------------------+-----------------------------------+ | ``FORMAT`` | Descriptor of sample data | :py:obj:`~.io.VcfParser.num_format` | :meth:`~.io.VcfParser.get_format` | +---------------+----------------------------+-------------------------------------+-----------------------------------+ | ``INFO`` | Descriptor of variant data | :py:obj:`~.io.VcfParser.num_info` | :meth:`~.io.VcfParser.get_info` | +---------------+----------------------------+-------------------------------------+-----------------------------------+ | ``META`` | Other meta-information | :py:obj:`~.io.VcfParser.num_meta` | :meth:`~.io.VcfParser.get_meta` | +---------------+----------------------------+-------------------------------------+-----------------------------------+ The last category, ``META``, represents all meta-information lines with a custom key (other than ``ALT``, ``FILTER``, ``FORMAT``, and ``INFO``). To collect all user-defined ``META`` entries as a dictionary, use the following expression:: >>> meta = dict([vcf.get_meta(i) for i in range(vcf.num_meta)]) >>> print(meta) {'fileDate': '20090805', 'source': 'myImputationProgramV3.1', 'reference': 'file:///seq/referen ces/1000GenomesPilot-NCBI36.fasta', 'contig': '', 'phasing': 'partial'} Reading variants ---------------- Due to the potentially large size of VCF files, the VCF parser follows an iterative scheme where lines are read one after another, only keeping the current one in memory. When iterating over a :class:`.io.VcfParser` instance, the returned values are the chromosome name, the position (0 being the first position of the chromosome), and the number of alleles (including the reference allele):: >>> for ret in vcf: ... print(ret) ... ('20', 14369, 2) ('20', 17329, 2) ('20', 1110695, 3) ('20', 1230236, 1) ('20', 1234566, 3) It is also possible to iterate manually (reading variants one by one without a ``for`` loop) using the global function :func:`next`:: >>> vcf.rewind() >>> while vcf.good: ... print(next(vcf)) ... ('20', 14369, 2) ('20', 17329, 2) ('20', 1110695, 3) ('20', 1230236, 1) ('20', 1234566, 3) (:meth:`~.io.VcfParser.rewind` is a method to go back at the beginning of the file.) If ``next(vcf)`` is called again when ``vcf.good`` is ``False``, then a :exc:`StopIteration` iteration is thrown (which is the standard behaviour for the implementation of iterable types in Python). Importing a site ---------------- Data for the current site of a :class:`!VcfParser` instance can be extracted as a :class:`.Site` instance using either the function :func:`.site_from_vcf` or the instance method :meth:`.Site.from_vcf`, provided that the VCF file has called genotypes encoded using the ``GT`` FORMAT field:: >>> vcf = egglib.io.VcfParser('example.vcf') >>> print(next(vcf)) ('20', 14369, 2) >>> site = egglib.site_from_vcf(vcf) >>> print(site.as_list()) ['G', 'G', 'A', 'G', 'A', 'A'] >>> print(next(vcf)) ('20', 17329, 2) >>> site.from_vcf(vcf) >>> print(site.as_list()) ['T', 'T', 'T', 'A', 'T', 'T'] Importing frequencies --------------------- For your information, one can extract allelic frequencies as a :class:`.Freq` instance using :func:`.freq_from_vcf` or :meth:`.Freq.from_vcf`, provided that the VCF file has frequency information encoded using the ``AN`` and ``AC`` INFO fields, which is not the case for our example file. Getting a variant as an object ------------------------------ To extract data manually for a given site, it is also possible to get all data at once. There is a :meth:`~.io.VcfParser.get_variant` method that returns an instance of a special type (:class:`.io.VcfVariant`). This is a proxy class, just like :class:`.SampleView`. Objects of the class :class:`!VcfVariant` provide a number of properties and methods that allow to read all desired data. We will just show a single example. The VCF file we use has a ``HQ`` FORMAT field (haplotype quality). We will extract it for each sample in a loop:: >>> vcf = egglib.io.VcfParser('example.vcf') >>> for chrom, pos, nall in vcf: ... v = vcf.get_variant() ... if 'HQ' in v.format_fields: ... print([i['HQ'] for i in v.samples]) ... else: ... print('no data') ... [(51, 51), (51, 51), (None, None)] [(58, 50), (65, 3), None] [(23, 27), (18, 2), None] [(56, 60), (51, 51), None] no data For each variant, we first tested that ``HQ`` is present in the FORMAT fields for this variant (in one instance, it is not the case). If so, it is extracted from the list of dictionaries provided as the property :py:attr:`~.io.VcfVariant.samples`.