The stats.ComputeStats class

The stats module contains a set of tools to analyze molecular variation data, either represented as Align instances (which have been extensively described in the previous part of the manual) or Site instances (which hold information for a single locus).

Among those tools, stats.ComputeStats is the more complex, and the more rich with functionalities. Most of the statistics can be computed using this class (in general, only the most complex statistics deserve their own function or class, such as part of linkage disequilibrium statistics, which are addressed near the end of this page).

The rationale for defining ComputeStats as a class rather than a standalone function is that it allows more flexibility (the following sections offer an overview of the possibilities) and that it can also be more efficient if a lot of items (alignments or sites) need to be processed in a row.

The basic worflow consists in the following steps, which are developped in the following paragraphs:

  1. Create an instance of the class,

  2. Set parameters,

  3. Enter the list of statistics to be computed,

  4. Process one or several alignments or sites.

Options

It is possible to pass parameters values either as arguments of stats.ComputeStats instances, or through their configure() method. The parameters are listed and explained in the documentation of this method. The most important is multi_hits. By default, multi_hits is False, causing all sites that have more than two alleles to be excluded as well. The default parameters might be what you want if you use single-nucleotide polymorphism markers or nucleotide sequence alignments, but certainly not if you use massively multiallelic markers.

To pass parameters, both syntaxes are equivalent:

>>> cs = egglib.stats.ComputeStats()
>>> cs.configure(multi_hits=True)

and

>>> cs = egglib.stats.ComputeStats(multi_hits=True)

Note also that it is also possible to call configure() later at any moment in the workflow. In general, it is recommended to reuse as much as possible objects whenever you perform intensive computations.

List of statistics

Statistics can be specified using the instance method add_stats(). It is possible to call this method several times in a row, it can be called at any time even after analysing data sets, and it is also possible to enter several statistics at once, as shown in the example below:

>>> cs.add_stats('S', 'Pi')
>>> cs.add_stats('D')

Statistics are represented by a code (as a str). A list of tuples matching each code with a short description can be obtained through the instance method list_stats(). A exhaustive notice is availabled in Diversity statistics.

>>> for stat, descr in cs.list_stats():
...     print(stat + ': ' + descr)
ns_site: Number of analyzed samples per site
ns_site_o: Number of analyzed outgroup samples per site
Aing: Number of alleles in ingroup
Aotg: Number of alleles in outgroup
Atot: Number of alleles in whole dataset
As: Number of singleton alleles
Asd: Number of singleton alleles (derived)
R: Allelic richness
thetaIAM: Theta estimator based on He & IAM model
thetaSMM: Theta estimator based on He & SMM model
[...skipped...]

There is a method to add all possible statistics (all_stats()), but note that, obviously, only those that can be computed (depending on the structure of the provided data set) will be computed (others will be set to None).

Note

Even if a statistic can be computed, it does not mean that it should. When analysing data, it can be counter-productive to generate huge tables of statistics among which some (many) may have too much variance to be interpreted anyway.

You can reset the list of statistics using:

>>> cs.clear_stats()

If you are happy with default values of parameters, you are not required to call configure(). However, it is necessary that you enter parameters to be computed, so you have to call add_stats(). Still, it is not an error to skip this step and proceed with analysing data (but in that case you will not get any results).

Processing data

Once needed parameters have been set and needed statistics have been entered, the user can analyze data using the following methods of the ComputeStats instance:

These methods are addressed in more details in the following section. By default, they all return a dict containing computed values of statistics corresponding to the analysis of data provided as argument. It is also possible to call them several times and access the results of the analysis of all passed data eventually (see Multiple alignments).

Computing statistics

Single alignment

Computing statistics from a single alignment should be rather straightforward. Assume we have a Fasta file containing a nucleotide alignment. We show below how we would compute a set of standard statistics used in the case of nucleotide sequences from this alignment:

>>> aln1 = egglib.io.from_fasta('align1.fas', labels=True, alphabet = egglib.alphabets.DNA)
>>> struct = egglib.struct_from_labels(aln1, lvl_pop=0, lvl_indiv=1)
>>> cs = egglib.stats.ComputeStats()
>>> cs.set_structure(struct)
>>> cs.add_stats('S', 'thetaW', 'Pi', 'D', 'lseff', 'nseff')
>>> stats = cs.process_align(aln1)
>>> print(aln1.ns, aln1.ls)
101 8942
>>> print(stats)
{'lseff': 3288, 'S': 305, 'thetaW': 59.02529109000289, 'nseff': 99.0, 'Pi': 67.56256441970726, 'D': 0.48870487478046226}

Note that in order to exclude the outgroup sequences from the analysis, we have to import labels and generate an appropriate Structure instance using the method struct_from_labels(). The management of structures is described in the next section. This structure is then imported in the ComputeStats instance using the method set_structure(). As visible in the statistics list, the number of available statistics is much larger than that, but many will be irrelevant for this kind of data (most of them being reported as None because they just can’t be computed). We first printed the ns (number of provided samples) and ls (alignment length) properties of the alignment: there are 101 samples and 8942 sites.

  • S is the number of polymorphic sites. Here we have 305 variable sites.

  • thetaW (\(\hat{\theta}_W\)) is the so-called Watterson’s estimator of \(\theta = 4N_e\mu\). The value is close to 60 which approaches 0.09 per analysed site (see below).

  • Pi (\(\pi\)) is the nucleotide diversity, which is a bit larger.

  • D (Tajima’s \(D\)) is positive (+0.49), which is a corollary of Pi being larger than thetaW.

  • lseff is the number of sites used for analysis (excluding those with either too many missing data or too many alleles). Here the number is 3288, meaning that 5654 sites have been excluded (mostly because the default is to exclude all sites with any missing data).

  • nseff is the average number of used samples among included sites. In our case, since only sites with no missing data at all have been used, the number of samples is 99 because the last two sequences are from the outgroup, but this value can be smaller if the parameter max_missing is larger than 0 (see example in Missing data).

Note

thetaW and Pi are given per gene (that is, they are summed over the total number of analyzed sites). To be expressed per site, they must be divided by the number of sites available for analysis. If all sites have been provided, this value is given by the statistic lseff. But lseff is relevant if only polymorphic or pre-filtered sites have been provided.

Multiple alignments

If you have several separate alignments but you want to obtain global statistics, the instinctive approach would be to concatenate the alignments and use the result in process_align(). There is a much more efficient way (assuming you want to do so with many alignments), described below.

There is a multi option which, if toggled, set ComputeStats to load data in several batches, perform intermediate computations but delay finalisation of statistics computations until all data have been loaded. In that case, you must call results() to actually compute and get statistics:

>>> alnA = aln1.extract(0, 4500)
>>> alnB = aln1.extract(4500, None)
>>> cs.configure(multi=True, struct=struct)
>>> cs.process_align(alnA)
>>> cs.process_align(alnB)
>>> stats = cs.results()
>>> print(stats)
{'thetaW': 59.02529109000289, 'D': 0.48870487478046226, 'nseff': 99.0, 'lseff': 3288, 'Pi': 67.56256441970726, 'S': 305}

The above example just cut the original alignment in two parts, and then processes the two alignments separately. The final dictionary of statistics, as expected, is exactly identical to the one obtained with the full alignment (n.b. even if it was specified before, it is necessary to pass again the structure, because the call to configure() resets all parameters that are not specified).

Phased data

Note that some statistics require that data are phased. That’s the case of all statistics based on haplotype and linkage disequilibrium. If multiple alignments are loaded, it is not ensured that data are phased, even if the number of samples matches. Make sure that your data are REALLY phased if you want to compute these statistics. By default EggLib will calculate them if the data allow it. Note also that all statistics requiring phase entail significantly longer computations when applied to large number of polymorphic sites. These stastistics are also hardly relevant over large genomic regions.

Using individuals sites

Individual sites can be processed as well. They are represented by the class Site. This class is aimed to represent data for any genetic marker such as a single nucleotide polymorphism (SNP), a microsatellite, an encoded insertion/deletion polymorphism, or any other form of genetic variation properly encoded using an appropriate Alphabet instance. The following functions allow to create a Site:

Generator functions

Source of data

site_from_align()

A position in an Align instance.

site_from_list()

A user-provided list of data.

site_from_vcf()

The current data of a VCF parser (see Using VCF files).

Note that each of those functions has a counterpart as a Site method to allow to recycle an existing object.

In the examples of the following paragraphs, we will create sites from the same alignment that we have been using in previous examples. However, in practice the class Site is mostly there for cases when individual sites are available. The small example below shows how to create a Site from a list of allelic values, which is the most simple and intuitive way:

>>> site = egglib.site_from_list(
...     ['C', 'G', 'G', 'C', 'T', 'T', 'G', 'T', 'G', 'G', 'G', 'G'],
...     alphabet=egglib.alphabets.DNA)

Note

When analysing individual sites, it is frequent that more than two alleles are present (except for canonical SNP sites). It can also happen with sequence alignments. If this is the case and you wish to include such sites in the analysis, don’t forget to set the ComputeStats parameter multi_hits to True.

Single site statistics

The code in the next example will clear the list of statistics and specify a list more adapted to single-site analysis, and then will analyse the site at position 66 (which is the 67th site):

>>> cs.clear_stats()
>>> cs.configure(multi=False, struct=struct)
>>> cs.add_stats('Aing', 'He', 'R')
>>> site = egglib.site_from_align(aln1, 66)
>>> stats = cs.process_site(site)
>>> print(stats)
{'Aing': 3, 'R': 0.02040816326530612, 'He': 0.6233766233766234}

The statistics computed here are:

  • Aing: the number of alleles in ingroup (this is a relatively unfrequent case with a SNP with three alleles within a nucleotide alignment).

  • R: the allelic richness.

  • He: the heterozygosity (which is above 0.5 only because there are three alleles).

Multiple sites statistics

You might be also interested in statistics over several sites. You can load multiple sites in a similar way as for alignments. In the example below we compute, in addition to Aing, R and He, the per-gene statistic D over all sites of the alignment, but we keep on computing statistics on a per-site basis:

>>> cs.add_stats('D', 'Pi')
>>> site = egglib.Site()
>>> for i in range(aln1.ls):
...     site.from_align(aln1,i)
...     stats = cs.process_site(site)
...     print(stats)
{'R': 0.0, 'He': 0.0, 'Pi': None, 'D': None, 'Aing': 1}
{'R': 0.0, 'He': 0.0, 'Pi': None, 'D': None, 'Aing': 1}
[...skipped...]
{'R': 0.010416666666666666, 'Pi': None, 'D': None, 'He': 0.3762886597938144, 'Aing': 2}
{'R': 0.01020408163265306, 'Pi': None, 'D': None, 'He': 0.37105751391465686, 'Aing': 2}
{'R': 0.010309278350515464, 'Pi': None, 'D': None, 'He': 0.3736587418472543, 'Aing': 2}
[...skipped...]

In this example, we recycled a unique Site instance using its method from_align() instead of creating a new instance at each iteration step, which is a good practice for performance reasons in this precise sitution. Note that there is also a process_sites() method to process a list of sites in one call. process_sites() is more appropriate to process dynamically generated arrays of sites as in this case.

Pi and D are not computed because they are not defined for individual sites. What we really want, in this example, is to replicate the analysis performed with process_align() to demonstrate how it would be done if we had only individual sites in the first place. This can be achieved by means of the multi argument:

>>> cs.configure(multi=False, struct=struct)
>>> for i in range(aln1.ls):
...     site.from_align(aln1,i)
...     stats = cs.process_site(site)
>>> print(cs.results())
{'He': 0.02184674638703229, 'Aing': 1.098813786929275, 'R': 0.0010210126233630454, 'Pi': 185.90085077948126, 'D': 0.47716739584524404}

Static list of sites

Linkage disequilibrium statistics, as well as \(\bar{r}_d\) (code rD), require that all sites used for analysis are available at the time of final computation. Those statistics cannot be computed by the method results() of ComputeStats if the data have been provided by process_site() or process_sites(). The examples below demonstrates it with the case of the ZnS statistic. Such statistic can be computed if we pass an Align:

>>> cs.clear_stats()
>>> cs.add_stats('ZnS')
>>> print(cs.process_align(aln1))
{'ZnS': 0.17236275214525582}

They can be as well if we pass several fragments, but only if they have the same number of samples (it is implied that the list of samples is matching, that is that data are phased):

>>> alnA = aln1.extract(0, 4500)
>>> alnB = aln1.extract(4500, None)
>>> cs.configure(multi=True)
>>> cs.set_structure(struct)
>>> cs.process_align(alnA)
>>> cs.process_align(alnB)
>>> print(cs.results())
{'ZnS': 0.17236275214525582}

To test what happens when we provide the sites individually, we extract all sites of the alignment. However, process_align() excludes by default all sites with any missing data in the ingroup while process_site() considers all sites which are provided. So we need to filter the sites ourselves, to only include those that have no missing data, hence 99 available samples, using the class Freq:

>>> sites = []
>>> frq = egglib.Freq()
>>> for i in range(aln1.ls):
...     site = egglib.site_from_align(aln1, i)
...     frq.from_site(site, struct)
...     if frq.nseff(frq.ingroup) == 99:
...         sites.append(site)

The statistics in question are not computed if we provide the sites individually and one by one, regardless of whether they are phased or not, because nothing guarantees that the site objects will be constant until statistics are actually computed:

>>> for site in sites:
...     cs.process_site(site)
>>> print(cs.results())
{'ZnS': None}

In that case, it is necessary to pass all sites grouped together in one list:

>>> print(cs.process_sites(sites))
{'ZnS': 0.17236275214525582}

Stand-alone allelic frequencies

It can happen that only allelic frequencies are available (such as with bulk sequencing). In that case Site is not appropriate because it requires an ordering of samples. It would be possible to arbitrarily create a site from a set of allelic frequencies but this would be a pointless waste of computing resources. There is a class in EggLib, named Freq, addressing this case.

There is a logical relationship between the classes Align, Site, and Freq: the latter two can be created based on instances of the respective previous one, but they can also be provided by external means. In other words, Align and even Site can be bypassed if the corresponding data are not available. We have seen in the previous section that it is not necessary to artifically create an Align instance if only available sites are available. Likewise, Site can be bypassed if only frequencies are available.

Like Site, Freq instances can be created from different types of source: from a Site instance, from a user-specified list, or from VCF data.

Generator functions

Source of data

freq_from_site()

A Site instance.

freq_from_list()

A user-provided list of data.

freq_from_vcf()

The current data of a VCF parser (see Using VCF files).

There are also equivalent methods to recycle an existing instance.

Creation of Freq instances from a site

If we go back to the example site created above, we see that creating a Freq instance is rather straighforward (see highlighted line below):

>>> site = egglib.site_from_list(
...     ['C', 'G', 'G', 'C', 'T', 'T', 'G', 'T', 'G', 'G', 'G', 'G'],
...     alphabet=egglib.alphabets.DNA)
>>> freq = egglib.freq_from_site(site)
>>> for i in range(freq.num_alleles):
...     print(freq.allele(i), freq.freq_allele(i))
C 2
G 7
T 3

Creation of Freq instances from user-provided frequency data

To create a Freq instance from already computed allelic frequencies, the syntax is logical but the data format must be followed carefully. Here, it is necessary to provide allelic or genotypic frequencies while taking into account population structure. Formally, freq_from_list() expects allelic population frequencies for an arbitrary number of clusters (at least one). The number of alleles, of clusters, and of populations per clusters are deduced from provided data. In practice, this requires that you provide a nested list of frequencies with three levels: clusters, populations, individuals.

The basic structure of the input nested list is:
  • First level: [cluster1, cluster2, ...]

  • Second level: cluster1 = [pop1, pop2, ...]

  • Third level: pop1 = [p1, p2, ...] where p1 is the absolute frequency of the first allele (the number of allele must match over all populations.

In many cases, there will be no clusters, which is actually equivalent to a single cluster. In this case, the first level would be a list containing a single item: [[pop1, pop2, ...]].

It is possible also that there is no population structure. Then this level can be also bypassed by using a single-item list.

The outgroup is loaded as the second argument, and is provided as another list that is analogous to an additional population (a list of frequencies of all alleles, even if there is only one sample).

Actually, for simple cases, the syntax is not verbose. The above example, assuming a single cluster and a single population, while providing allelic frequencies only, writes as follows:

>>> freq = egglib.freq_from_list([[[3, 3, 2, 1, 1]]], [1, 0, 0, 0, 0])

If your data are based on genotypes, you can provide genotypic frequencies. In this case you are required to provided as well a list detailing the allele composition of each genotypes considered. In the example above, there are also five genotypes, but each is represented by a single copy. We need to recode them, and use the geno_list option:

>>> freq = egglib.freq_from_list([[[1, 1, 1, 1, 1]]], [1, 0, 0, 0, 0],
...         geno_list=[(0, 0), (0, 1), (2, 2), (3, 1), (1, 4)],
...         alphabet=egglib.alphabets.positive_infinite)

The allelic frequencies are computed automatically based on the composition of genotypes as provided.

Computing diversity statistics

Using the same example as above, we can see that we can also compute diversity statistics for a single site or from an array of sites when only frequencies are available (obviously, statistics requiring haplotypic information will not be available). For this we need to use the method process_freq() of ComputeStats:

>>> cs.clear_stats()
>>> cs.add_stats('Aing', 'He', 'R')
>>> site = egglib.site_from_align(aln1, 66)
>>> freq = egglib.freq_from_site(site)
>>> print(cs.process_freq(freq))
{'He': 0.6178217821782178, 'Aing': 3, 'R': 0.02}
>>> cs.add_stats('D', 'Pi')
>>> cs.configure(multi=True)
>>> cs.set_structure(struct)
>>> site = egglib.Site()
>>> freq = egglib.Freq()
>>> for i in range(aln1.ls):
...     site.from_align(aln1,i)
...     freq.from_site(site)
...     cs.process_freq(freq)
>>> print(cs.results())
{'Aing': 1.1378886155222545, 'Pi': 193.00524876495805, 'He': 0.023322406154947233, 'D': -0.4440572753610234, 'R': 0.0013951963115034285}