.. _manual_compute_stats:

---------------------------------------
The :class:`!stats.ComputeStats` class
---------------------------------------

The :ref:`stats <stats>` module contains a set of tools to analyze molecular
variation data, either represented as :class:`.Align` instances (which
have been extensively described in the previous part of the manual) or
:class:`.Site` instances (which hold information for a single locus).

Among those tools, :class:`.stats.ComputeStats` is the more complex, 
and the more rich with functionalities. Most of the statistics can be 
computed using this class (in general, only the most complex statistics 
deserve their own function or class, such as part of linkage 
disequilibrium statistics, which are addressed near the end of this 
page).

The rationale for defining :class:`!ComputeStats` as a class rather than a
standalone function is that it allows more flexibility (the following
sections offer an overview of the possibilities) and that it can also be
more efficient if a lot of items (alignments or sites) need to be processed
in a row.

The basic worflow consists in the following steps, which are developped in
the following paragraphs:

#. Create an instance of the class,
#. Set parameters,
#. Enter the list of statistics to be computed,
#. Process one or several alignments or sites.

Options
=======

It is possible to pass parameters values either as arguments of 
:class:`.stats.ComputeStats` instances, or through their 
:meth:`~.ComputeStats.configure` method. The parameters are listed and 
explained in the documentation of this method. The most important is 
*multi_hits*. By default, *multi_hits* is ``False``, causing all sites 
that have more than two alleles to be excluded as well. The default 
parameters might be what you want if you use single-nucleotide 
polymorphism markers or nucleotide sequence alignments, but certainly 
not if you use massively multiallelic markers.

To pass parameters, both syntaxes are equivalent::

    >>> cs = egglib.stats.ComputeStats()
    >>> cs.configure(multi_hits=True)

and ::

    >>> cs = egglib.stats.ComputeStats(multi_hits=True)

Note also that it is also possible to call :meth:`!configure` later at
any moment in the workflow. In general, it is recommended to reuse as
much as possible objects whenever you perform intensive computations.

List of statistics
==================

Statistics can be specified using the instance method :meth:`~.ComputeStats.add_stats`.
It is possible to call this method several times in a row, it can be
called at any time even after analysing data sets, and it is also possible
to enter several statistics at once, as shown in the example below::

    >>> cs.add_stats('S', 'Pi')
    >>> cs.add_stats('D')

Statistics are represented by a code (as a :class:`str`). A list of
tuples matching each code with a short description can be obtained through
the instance method :meth:`~.ComputeStats.list_stats`. A exhaustive notice
is availabled in :ref:`stats-notice`. ::

    >>> for stat, descr in cs.list_stats():
    ...     print(stat + ': ' + descr)
    ns_site: Number of analyzed samples per site
    ns_site_o: Number of analyzed outgroup samples per site
    Aing: Number of alleles in ingroup
    Aotg: Number of alleles in outgroup
    Atot: Number of alleles in whole dataset
    As: Number of singleton alleles
    Asd: Number of singleton alleles (derived)
    R: Allelic richness
    thetaIAM: Theta estimator based on He & IAM model
    thetaSMM: Theta estimator based on He & SMM model
    [...skipped...]

There is a method to add all possible statistics (:meth:`~.ComputeStats.all_stats`),
but note that, obviously, only those that can be computed (depending on
the structure of the provided data set) will be computed (others will be
set to ``None``).

.. note::

    Even if a statistic can be computed, it does not mean that it should.
    When analysing data, it can be counter-productive to generate huge
    tables of statistics among which some (many) may have too much variance
    to be interpreted anyway.

You can reset the list of statistics using::

    >>> cs.clear_stats()

If you are happy with default values of parameters, you are not 
required to call :meth:`~.stats.ComputeStats.configure`. However, it is 
necessary that you enter parameters to be computed, so you have to call 
:meth:`~.stats.ComputeStats.add_stats`. Still, it is not an error to 
skip this step and proceed with analysing data (but in that case you 
will not get any results).

Processing data
===============

Once needed parameters have been set and needed statistics have been
entered, the user can analyze data using the following methods of
the :class:`!ComputeStats` instance:

    * :meth:`~.ComputeStats.process_align` to process an :class:`.Align` instance.
    * :meth:`~.ComputeStats.process_site` to process a :class:`.Site` instance.
    * :meth:`~.ComputeStats.process_sites` to process several :class:`.Site` instances.
    * :meth:`~.ComputeStats.process_freq` to process a :class:`.Freq` instance (see below).

These methods are addressed in more details in the following section.
By default, they all return a :class:`dict` containing computed values
of statistics corresponding to the analysis of data provided as argument.
It is also possible to call them several times and access the results of
the analysis of all passed data eventually (see :ref:`stats-multiple`).

--------------------
Computing statistics
--------------------

Single alignment
================

Computing statistics from a single alignment should be rather 
straightforward. Assume we have a Fasta file containing a nucleotide 
alignment. We show below how we would compute a set of standard 
statistics used in the case of nucleotide sequences from this 
alignment::

    >>> aln1 = egglib.io.from_fasta('align1.fas', labels=True, alphabet = egglib.alphabets.DNA)
    >>> struct = egglib.struct_from_labels(aln1, lvl_pop=0, lvl_indiv=1)
    >>> cs = egglib.stats.ComputeStats()
    >>> cs.set_structure(struct)
    >>> cs.add_stats('S', 'thetaW', 'Pi', 'D', 'lseff', 'nseff')
    >>> stats = cs.process_align(aln1)
    >>> print(aln1.ns, aln1.ls)
    101 8942
    >>> print(stats)
    {'lseff': 3288, 'S': 305, 'thetaW': 59.02529109000289, 'nseff': 99.0, 'Pi': 67.56256441970726, 'D': 0.48870487478046226}

Note that in order to exclude the outgroup sequences from the analysis, 
we have to import labels and generate an appropriate :class:`.Structure`
instance using the method :func:`.struct_from_labels()`. The management of
structures is described in the next section. This structure is then imported
in the :class:`!ComputeStats` instance using the method :meth:`~.ComputeStats.set_structure()`.
As visible in the statistics list, the number of available statistics is
much larger than that, but many will be irrelevant for this kind of data
(most of them being reported as ``None`` because they just can't be computed).
We first printed the :py:obj:`~.Align.ns` (number of provided samples) and
:py:obj:`~.Align.ls` (alignment length) properties of the alignment: there are 101
samples and 8942 sites.

    * ``S`` is the number of polymorphic sites. Here we have 305 variable sites.
    * ``thetaW`` (:math:`\hat{\theta}_W`) is the so-called Watterson's estimator of
      :math:`\theta = 4N_e\mu`. The value is close to 60 which approaches
      0.09 per analysed site (see below).
    * ``Pi`` (:math:`\pi`) is the nucleotide diversity, which is a bit
      larger.
    * ``D`` (Tajima's :math:`D`) is positive (+0.49), which is a corollary of
      ``Pi`` being larger than ``thetaW``.
    * ``lseff`` is the number of sites used for analysis (excluding those
      with either too many missing data or too many alleles). Here the
      number is 3288, meaning that 5654 sites have been excluded (mostly because
      the default is to exclude all sites with any missing data).
    * ``nseff`` is the average number of used samples among included sites.
      In our case, since only sites with no missing data at all have been
      used, the number of samples is 99 because the last two sequences are 
      from the outgroup, but this value can be smaller if the parameter 
      *max_missing* is larger than 0 (see example in :ref:`manual-missing-data`).

.. note::

    ``thetaW`` and ``Pi`` are given per gene (that is, they are
    summed over the total number of analyzed sites). To be expressed
    per site, they must be divided by the number of sites available for
    analysis. If all sites have been provided, this value is given by
    the statistic ``lseff``. But ``lseff`` is relevant if only
    polymorphic or pre-filtered sites have been provided.

.. _stats-multiple:

Multiple alignments
===================

If you have several separate alignments but you want to obtain global
statistics, the instinctive approach would be to concatenate the alignments
and use the result in :meth:`!process_align`. There is a much more efficient
way (assuming you want to do so with many alignments), described below.

There is a :fparam:`multi` option which, if toggled, set :class:`!ComputeStats`
to load data in several batches, perform intermediate computations but
delay finalisation of statistics computations until all data have been
loaded. In that case, you must call :meth:`~.ComputeStats.results` to
actually compute and get statistics::

    >>> alnA = aln1.extract(0, 4500)
    >>> alnB = aln1.extract(4500, None)
    >>> cs.configure(multi=True, struct=struct)
    >>> cs.process_align(alnA)
    >>> cs.process_align(alnB)
    >>> stats = cs.results()
    >>> print(stats)
    {'thetaW': 59.02529109000289, 'D': 0.48870487478046226, 'nseff': 99.0, 'lseff': 3288, 'Pi': 67.56256441970726, 'S': 305}

The above example just cut the original alignment in two parts, and then
processes the two alignments separately. The final dictionary of statistics,
as expected, is exactly identical to the one obtained with the full
alignment (n.b. even if it was specified before, it is necessary to pass
again the structure, because the call to :meth:`!configure` resets all
parameters that are not specified).

Phased data
***********

Note that some statistics require that data are phased. That's the case
of all statistics based on haplotype and linkage disequilibrium. If multiple
alignments are loaded, it is not ensured that data are phased, even if the number
of samples matches. Make sure that your data are REALLY phased if you want to compute
these statistics. By default EggLib will calculate them if the data allow it.
Note also that all statistics requiring phase entail significantly longer
computations when applied to large number of polymorphic sites. These stastistics
are also hardly relevant over large genomic regions.

Using individuals sites
=======================

Individual sites can be processed as well. They are represented by the
class :class:`.Site`. This class is aimed to represent data for any
genetic marker such as a single nucleotide polymorphism (SNP), a
microsatellite, an encoded insertion/deletion polymorphism, or any other
form of genetic variation properly encoded using an appropriate :class:`.Alphabet` instance.
The following functions allow to create a :class:`!Site`:

+--------------------------+-----------------------------------------------------------+ 
| Generator functions      | Source of data                                            |
+==========================+===========================================================+
| :func:`.site_from_align` | A position in an :class:`.Align` instance.                |
+--------------------------+-----------------------------------------------------------+ 
| :func:`.site_from_list`  | A user-provided :class:`list` of data.                    |
+--------------------------+-----------------------------------------------------------+ 
| :func:`.site_from_vcf`   | The current data of a VCF parser (see :ref:`manual-vcf`). |
+--------------------------+-----------------------------------------------------------+ 

Note that each of those functions has a counterpart as a :class:`!Site`
method to allow to recycle an existing object.

In the examples of the following paragraphs, we will create sites from
the same alignment that we have been using in previous examples. However,
in practice the class :class:`!Site` is mostly there for cases when
individual sites are available. The small example below shows how to
create a :class:`!Site` from a list of allelic values, which is the most
simple and intuitive way::

    >>> site = egglib.site_from_list(
    ...     ['C', 'G', 'G', 'C', 'T', 'T', 'G', 'T', 'G', 'G', 'G', 'G'],
    ...     alphabet=egglib.alphabets.DNA)

.. note::

    When analysing individual sites, it is frequent that more than two
    alleles are present (except for canonical SNP sites). It can also
    happen with sequence alignments. If this is the case and you wish to
    include such sites in the analysis, don't forget to set the
    :class:`!ComputeStats` parameter *multi_hits* to ``True``.


Single site statistics
**********************

The code in the next example will clear the list of statistics and 
specify a list more adapted to single-site analysis, and then will 
analyse the site at position 66 (which is the 67\ :sup:`th` site)::

    >>> cs.clear_stats()
    >>> cs.configure(multi=False, struct=struct)
    >>> cs.add_stats('Aing', 'He', 'R')
    >>> site = egglib.site_from_align(aln1, 66)
    >>> stats = cs.process_site(site)
    >>> print(stats)
    {'Aing': 3, 'R': 0.02040816326530612, 'He': 0.6233766233766234}

The statistics computed here are:

    * ``Aing``: the number of alleles in ingroup (this is a relatively unfrequent case
      with a SNP with three alleles within a nucleotide alignment).
    * ``R``: the allelic richness.
    * ``He``: the heterozygosity (which is above 0.5 only because there are three alleles).

Multiple sites statistics
*************************

You might be also interested in statistics over several sites. You can
load multiple sites in a similar way as for alignments. In the example
below we compute, in addition to ``Aing``, ``R`` and ``He``, the
per-gene statistic ``D`` over all sites of the alignment, but we keep
on computing statistics on a per-site basis::

    >>> cs.add_stats('D', 'Pi')
    >>> site = egglib.Site()
    >>> for i in range(aln1.ls):
    ...     site.from_align(aln1,i)
    ...     stats = cs.process_site(site)
    ...     print(stats)
    {'R': 0.0, 'He': 0.0, 'Pi': None, 'D': None, 'Aing': 1}
    {'R': 0.0, 'He': 0.0, 'Pi': None, 'D': None, 'Aing': 1}
    [...skipped...]
    {'R': 0.010416666666666666, 'Pi': None, 'D': None, 'He': 0.3762886597938144, 'Aing': 2}
    {'R': 0.01020408163265306, 'Pi': None, 'D': None, 'He': 0.37105751391465686, 'Aing': 2}
    {'R': 0.010309278350515464, 'Pi': None, 'D': None, 'He': 0.3736587418472543, 'Aing': 2}
    [...skipped...]

In this example, we recycled a unique :class:`!Site` instance using its
method :meth:`~.Site.from_align` instead of creating a new instance at
each iteration step, which is a good practice for performance reasons in
this precise sitution. Note that there is also a
:meth:`~.ComputeStats.process_sites` method to process a list of sites
in one call. :meth:`!process_sites` is more appropriate to process
dynamically generated arrays of sites as in this case.

``Pi`` and ``D`` are not computed because they are not defined for
individual sites. What we really want, in this example, is to replicate
the analysis performed with :meth:`!process_align` to demonstrate how it
would be done if we had only individual sites in the first place. This
can be achieved by means of the *multi* argument:

    >>> cs.configure(multi=False, struct=struct)
    >>> for i in range(aln1.ls):
    ...     site.from_align(aln1,i)
    ...     stats = cs.process_site(site)
    >>> print(cs.results())
    {'He': 0.02184674638703229, 'Aing': 1.098813786929275, 'R': 0.0010210126233630454, 'Pi': 185.90085077948126, 'D': 0.47716739584524404}

Static list of sites
********************

Linkage disequilibrium statistics, as well as :math:`\bar{r}_d` (code 
``rD``), require that all sites used for analysis are available at the 
time of final computation. Those statistics cannot be computed by the 
method :meth:`~.stats.ComputeStats.results` of 
:class:`!ComputeStats` if the data have been provided by 
:meth:`!process_site` or :meth:`!process_sites`. The examples below 
demonstrates it with the case of the ``ZnS`` statistic. Such statistic 
can be computed if we pass an :class:`!Align`::

    >>> cs.clear_stats()
    >>> cs.add_stats('ZnS')
    >>> print(cs.process_align(aln1))
    {'ZnS': 0.17236275214525582}

They can be as well if we pass several fragments, but only if they have
the same number of samples (it is implied that the list of samples is
matching, that is that data are phased)::

    >>> alnA = aln1.extract(0, 4500)
    >>> alnB = aln1.extract(4500, None)
    >>> cs.configure(multi=True)
    >>> cs.set_structure(struct)
    >>> cs.process_align(alnA)
    >>> cs.process_align(alnB)
    >>> print(cs.results())
    {'ZnS': 0.17236275214525582}

To test what happens when we provide the sites individually, we extract 
all sites of the alignment. However, :meth:`~.stats.ComputeStats.process_align`
excludes by default all sites with any missing data in the ingroup while
:meth:`!process_site` considers all sites which are provided. So we need
to filter the sites ourselves, to only include those that have no
missing data, hence 99 available samples, using the class :class:`.Freq`::

    >>> sites = []
    >>> frq = egglib.Freq()
    >>> for i in range(aln1.ls):
    ...     site = egglib.site_from_align(aln1, i)
    ...     frq.from_site(site, struct)
    ...     if frq.nseff(frq.ingroup) == 99:
    ...         sites.append(site)

The statistics in question are not computed if we provide the sites
individually and one by one, regardless of whether they are phased or not,
because nothing guarantees that the site objects will be constant until
statistics are actually computed::

    >>> for site in sites:
    ...     cs.process_site(site)
    >>> print(cs.results())
    {'ZnS': None}

In that case, it is necessary to pass all sites grouped together in one
list::

    >>> print(cs.process_sites(sites))
    {'ZnS': 0.17236275214525582}

Stand-alone allelic frequencies
===============================

It can happen that only allelic frequencies are available (such as with
bulk sequencing). In that case :class:`!Site` is not appropriate because
it requires an ordering of samples. It would be possible to arbitrarily
create a site from a set of allelic frequencies but this would be a
pointless waste of computing resources. There is a class in EggLib, named
:class:`.Freq`, addressing this case.

There is a logical relationship between the classes :class:`!Align`,
:class:`!Site`, and :class:`!Freq`: the latter two can be created based on
instances of the respective previous one, but they can also be provided by
external means. In other words, :class:`!Align` and even :class:`!Site` can
be bypassed if the corresponding data are not available. We have seen in the
previous section that it is not necessary to artifically create an
:class:`!Align` instance if only available sites are available. Likewise,
:class:`!Site` can be bypassed if only frequencies are available.

Like :class:`!Site`, :class:`!Freq` instances can be created from different types
of source: from a :class:`!Site` instance, from a user-specified list, or from VCF data.

+-------------------------+-----------------------------------------------------------+ 
| Generator functions     | Source of data                                            |
+=========================+===========================================================+
| :func:`.freq_from_site` | A :class:`!Site` instance.                                |
+-------------------------+-----------------------------------------------------------+ 
| :func:`.freq_from_list` | A user-provided :class:`list` of data.                    |
+-------------------------+-----------------------------------------------------------+ 
| :func:`.freq_from_vcf`  | The current data of a VCF parser (see :ref:`manual-vcf`). |
+-------------------------+-----------------------------------------------------------+ 

There are also equivalent methods to recycle an existing instance.

Creation of :class:`!Freq` instances from a site
************************************************

If we go back to the example site created above, we see that creating a
:class:`.Freq` instance is rather straighforward (see highlighted line below):

.. code-block:: python
   :emphasize-lines: 4

   >>> site = egglib.site_from_list(
   ...     ['C', 'G', 'G', 'C', 'T', 'T', 'G', 'T', 'G', 'G', 'G', 'G'],
   ...     alphabet=egglib.alphabets.DNA)
   >>> freq = egglib.freq_from_site(site)
   >>> for i in range(freq.num_alleles):
   ...     print(freq.allele(i), freq.freq_allele(i))
   C 2
   G 7
   T 3

Creation of :class:`!Freq` instances from user-provided frequency data
**********************************************************************

To create a :class:`.Freq` instance from already computed allelic frequencies,
the syntax is logical but the data format must be followed carefully.
Here, it is necessary to provide allelic or genotypic frequencies while taking into account
population structure.
Formally, :func:`.freq_from_list` expects allelic population frequencies for
an arbitrary number of clusters (at least one). The number of alleles, of
clusters, and of populations per clusters are deduced from provided data.
In practice, this requires that you provide a nested list of frequencies with
three levels: clusters, populations, individuals.

The basic structure of the input nested list is:
    * First level: ``[cluster1, cluster2, ...]``
    * Second level: ``cluster1 = [pop1, pop2, ...]``
    * Third level: ``pop1 = [p1, p2, ...]`` where ``p1`` is the absolute
      frequency of the first allele (the number of allele must match
      over all populations.

In many cases, there will be no clusters, which is actually equivalent to
a single cluster. In this case, the first level would be a list containing
a single item: ``[[pop1, pop2, ...]]``.

It is possible also that there is no population structure. Then this level
can be also bypassed by using a single-item list.

The outgroup is loaded as the second argument, and is provided as another
list that is analogous to an additional population (a list of frequencies
of all alleles, even if there is only one sample).

Actually, for simple cases, the syntax is not verbose. The above example,
assuming a single cluster and a single population, while providing allelic
frequencies only, writes as follows::

    >>> freq = egglib.freq_from_list([[[3, 3, 2, 1, 1]]], [1, 0, 0, 0, 0])

If your data are based on genotypes, you can provide genotypic frequencies.
In this case you are required to provided as well a list detailing the
allele composition of each genotypes considered. In the example above, there
are also five genotypes, but each is represented by a single copy. We need
to recode them, and use the *geno_list* option::

    >>> freq = egglib.freq_from_list([[[1, 1, 1, 1, 1]]], [1, 0, 0, 0, 0],
    ...         geno_list=[(0, 0), (0, 1), (2, 2), (3, 1), (1, 4)],
    ...         alphabet=egglib.alphabets.positive_infinite)

The allelic frequencies are computed automatically based on the composition
of genotypes as provided.

Computing diversity statistics
******************************

Using the same example as above, we can see that we can also compute
diversity statistics for a single site or from an array of sites when
only frequencies are available (obviously, statistics requiring
haplotypic information will not be available). For this we need to use
the method :meth:`~.stats.ComputeStats.process_freq` of :class:`~.stats.ComputeStats`::

    >>> cs.clear_stats()
    >>> cs.add_stats('Aing', 'He', 'R')
    >>> site = egglib.site_from_align(aln1, 66)
    >>> freq = egglib.freq_from_site(site)
    >>> print(cs.process_freq(freq))
    {'He': 0.6178217821782178, 'Aing': 3, 'R': 0.02}
    >>> cs.add_stats('D', 'Pi')
    >>> cs.configure(multi=True)
    >>> cs.set_structure(struct)
    >>> site = egglib.Site()
    >>> freq = egglib.Freq()
    >>> for i in range(aln1.ls):
    ...     site.from_align(aln1,i)
    ...     freq.from_site(site)
    ...     cs.process_freq(freq)
    >>> print(cs.results())
    {'Aing': 1.1378886155222545, 'Pi': 193.00524876495805, 'He': 0.023322406154947233, 'D': -0.4440572753610234, 'R': 0.0013951963115034285}