The stats.ComputeStats class

The stats module contains a set of tools to analyze molecular variation data, either represented as Align instances (which have been extensively described in the previous part of the manual) or Site instances (which hold information for a single locus).

Among those tools, stats.ComputeStats is the more complex, and the more rich with functionalities. Most of the statistics can be computed using ComputeStats (in general, only the most complex statistics deserve their own function or class; such as part of linkage disequilibrium statistics, which are addressed near the end of this page).

The rationale for defining ComputeStats as a class rather than a standalone function is that it allows more flexibility (the following sections offer an overview of the possibilities) and that it can also be more efficient if a lot of items (alignments or sites) need to be processed in a row.

The basic worflow consists in the following steps, which are developped in the following paragraphs:

  • Create an instance of the class,

  • Set parameters,

  • Enter the list of statistics to be computed,

  • Process one or several alignments or sites.

Options

It is possible to pass parameters values either as arguments of stats.ComputeStats instances, or through their configure() method. The parameters are listed and explained in the documentation of this method. The most important is multi_hits. By default, multi_hits is False, causing all sites that have more than two alleles to be excluded as well. The default parameters might be what you want if you use single-nucleotide polymorphism markers or nucleotide sequence alignments, but certainly not if you use massively multiallelic markers.

To pass parameters, both syntaxes are equivalent:

>>> cs = egglib.stats.ComputeStats()
>>> cs.configure(multi_hits=True)

and

>>> cs = egglib.stats.ComputeStats(multi_hits=True)

Note also that it is also possible to call configure() later at any moment in the workflow. In general, try as much as possible to reuse objects whenever you perform intensive computations.

List of statistics

Statistics can be specified using the instance method add_stats(). It is possible to call this method several times in a row, it can be called at any time even after analysing data sets, and it is also possible to enter several statistics at once, as shown in the example below:

>>> cs.add_stats('S', 'Pi')
>>> cs.add_stats('D')

Statistics are represented by a code (as a str). A list of tuples matching each code with a short description can be obtained through the instance method list_stats(). A exhaustive notice is availabled in Diversity statistics.

There is a method to add all possible statistics (all_stats()), but note that, obviously, only those that can be computed (depending on the structure of the provided data set) will be computed (others will be set to None).

Note

Even if a statistic can be computed, it does not mean that it should. When analysing data, it can be counter-productive to generate huge tables of statistics among which some (many) may have too much variance to be interpreted anyway.

You can reset the list of statistics using:

>>> cs.clear_stats()

If you are happy with default values of parameters, you are not required to call configure(). However, it is necessary that you enter parameters to be computed, so you have to call add_stats(). Still, it is not an error to skip this step and proceed with analysing data (but in that case you will not get any results).

Processing data

Once needed parameters have been set and needed statistics have been entered, the user can analyze data using the following methods of the stats.ComputeStats instance:

These methods are addressed in more details in the following section. By default, they all return a dict containing computed values of statistics corresponding to the analysis of data provided as argument. It is also possible to call them several times and access the results of the analysis of all passed data eventually (see Multiple alignments).

Computing statistics

Single alignment

Computing statistics from a single alignment should be rather straightforward. Assume with have a Fasta file containing a nucleotide alignment. We show below how we would compute a set of standard statistics used in the case of nucleotide sequences from this alignment:

>>> aln1 = egglib.io.from_fasta('align1.fas', labels=True, alphabet = egglib.alphabets.DNA)
>>> struct = egglib.struct_from_labels(aln1, lvl_pop=0, lvl_indiv=1)
>>> cs = egglib.stats.ComputeStats()
>>> cs.set_structure(struct)
>>> cs.add_stats('S', 'thetaW', 'Pi', 'D', 'lseff', 'nseff')
>>> stats = cs.process_align(aln1)
>>> print(aln1.ns, aln1.ls)
101 8942
>>> print(stats)
{'lseff': 3288, 'S': 305, 'thetaW': 59.02529109000289, 'nseff': 99.0, 'Pi': 67.56256441970726, 'D': 0.48870487478046226}

Note that in order to exclude the outgroup sequences from the analysis, we have to import labels and generate an appropriate Structure instance using the method from_labels(). The management of structure is described in the next section. This structure is then imported in the stats.ComputeStats instance using the method set_structure(). As visible in the statistics list, the number of available statistics is much larger than that, but many will be irrelevant for this kind of data (most of them being reported as None because they just can’t be computed). We first printed the ns (number of provided samples) and ls (alignment length) properties of the alignment: there are 101 samples and 8942 sites

  • S is the number of polymorphic sites. Here we have 305 variable sites.

  • thetaW (\(\hat{\theta}_W\)) is the so-called Watterson’s estimator of \(\theta = 4N_e\mu\). The value is close to 60 which approaches 0.09 per analysed site (see below).

  • Pi (\(\pi\)) is the nucleotide diversity, which is a bit larger.

  • D (Tajima’s \(D\)) is positive (+0.49), which is a corollary of Pi being larger than thetaW.

  • lseff is the number of sites used for analysis (excluding those with either too many missing data or too many alleles). Here the number is 3288, meaning that 5654 sites have been excluded (because the default is to exclude all sites with any missing data).

  • nseff is the average number of used samples among included sites. In our case, since only sites with no missing data at all have been used, the number of samples is 99 because the last two sequences are from the outgroup, but this value can be smaller if the parameter max_missing is larger than 0 (see example in Missing data).

Note

thetaW and Pi are given per gene (that is, they are summed over the total number of analyzed sites). To be expressed per site, they must be divided by lseff.

Multiple alignments

If you have several separate alignments but you want to obtain global statistics, the instinctive approach would be to concatenate the alignments and use the result in process_align(). There is a much more efficient way (assuming you want to do so with many alignments), described below.

There is a multi option which, if toggled, set ComputeStats to load data, and wait for more data before computing statistics. In that case, you must call results() to actually compute and get statistics:

>>> alnA = aln1.extract(0, 4500)
>>> alnB = aln1.extract(4500, None)
>>> cs.configure(multi=True)
>>> cs.process_align(alnA)
>>> cs.process_align(alnB)
>>> stats = cs.results()
>>> print(stats)
{'lseff': 3297, 'S': 314, 'thetaW': 60.76702099101937, 'nseff': 99.0, 'Pi': 71.44588744588735, 'D': 0.5939904839741603}

The above example just cut the original alignment in two parts, and then processes the two alignments separately. The final dictionary of statistics, as expected, is exactly identical to the one obtained with the full alignment.

Phased data

Note that some statistics require that data are phased. That’s the case of all statistics based on haplotype and linkage disequilibrium. If multiple alignments are loaded, it is not ensured that data are phased, even if the number of samples matches. Make sure that your data are REALLY phased if you want to compute these statistics. By default EggLib will calculate them if the data allow it.

Using individuals sites

Individual sites can be processed as well. They are represented by the class Site. This class is aimed to represent data for any genetic marker such as a single nucleotide polymorphism (SNP), a microsatellite, an encoded insertion/deletion polymorphism, or any other form of genetic variation properly encoded using an appropriate alphabets.Alphabet instance. The following functions allow to create a Site:

Generator functions

Source of data

site_from_align()

A position in an Align instance.

site_from_list()

A user-provided list of data.

site_from_vcf()

The current data of a VCF parser (see Using VCF files).

Note that each of those functions has a counterpart as a Site method to allow to recycle an existing object if one already exists.

In the examples of the following paragraphs, we will create sites from the same alignment that we have been using in previous examples. However, in practice the class Site is mostly there for cases when individual sites are available. The small example below shows how to create a Site from a list of allelic values, which is the most simple and intuitive way:

>>> site = egglib.site_from_list(['C', 'G', 'G', 'C', 'T', 'T', 'G', 'T', 'G', 'G', 'G', 'G'], alphabet=egglib.alphabets.DNA)

Note

When analysing individual sites, it is frequent that more than two alleles are present (except for canonical SNP sites). It can also happen with sequence alignement. If this is the case and you wish to include such sites in the analysis, don’t forget to set the ComputeStats parameter multi_hits to True.

Single site statistics

The code in the next example will clear the list of statistics and specify a list more adapted to single-site analysis, and then will analyse the site at position 66:

>>> cs.clear_stats()
>>> cs.configure(multi=False)
>>> cs.add_stats('Aing', 'He', 'R')
>>> site = egglib.site_from_align(aln1, 66)
>>> stats = cs.process_site(site)
>>> print(stats)
{'R': 0.02, 'Aing': 3, 'He': 0.6178217821782178}

The statistics computed here are:

  • Aing: the number of alleles in ingroup (this is a relatively unfrequent case with a SNP with three alleles within a nucleotide alignment).

  • R: the allelic richness.

  • He: the heterozygosity (which is above 0.5 only because there are three alleles).

Multiple sites statistics

You might be also interested in statistics over several sites. You can load multiple sites in a similar way as for alignments. In the example below we compute, in addition to Aing, R and He, the per-gene statistic D over all sites of the alignment:

>>> cs.add_stats('D', 'Pi')
>>> for i in range(aln1.ls):
...     site = egglib.site_from_align(aln1,i)
...     stats = cs.process_site(site)
...     print(stats)
{'Pi': None, 'R': 0.0, 'Aing': 1, 'D': None, 'He': 0.0}
{'Pi': None, 'R': 0.0, 'Aing': 1, 'D': None, 'He': 0.0}
...

Here we only show the two first sites of the alignment which happen to be fixed which explain that some statistics are not calculated. However note that with this syntax a Site instance is created at each iteration. A more efficient way to achieve the same result (especially when there is a large number of sites to iterate over) is by recycling over and over again the same instance is by using the method from_align() from the class itself:

>>> cs.add_stats('D', 'Pi')
>>> site = egglib.Site()
>>> for i in range(aln1.ls):
...     site.from_align(aln1,i)
...     stats=cs.process_site(site)
...     print(stats)
{'Pi': None, 'R': 0.0, 'Aing': 1, 'D': None, 'He': 0.0}
{'Pi': None, 'R': 0.0, 'Aing': 1, 'D': None, 'He': 0.0}
...

Note that there is also a process_sites() method to process a list of sites in one call.

Note

Computing per-gene statistics (such as D) through sites extracted from an Align does not necessarily make a lot of sense rather than using process_align() directly. However, process_site() is useful if sites are obtained by other means than from an alignment.

Static list of sites

Linkage disequilibrium statistics, as well as \(\bar{r}_d\) (code rD), require that all sites used for analysis are available at the time. Those statistics cannot be computed by the method results() of stats.ComputeStats if the data have been provided by process_site() or process_sites(), even if phased is set to True. The examples below demonstrate it with the case of the ZnS statistic. Such statistic can be computed if we pass an Align:

>>> cs.clear_stats()
>>> cs.add_stats('ZnS')
>>> print(cs.process_align(aln1))
{'ZnS': 0.17236275214525582}

They can be as well if we pass several fragments, but only if they have the same number of samples and we promise that the list of samples is matching (data phased):

>>> alnA = aln1.extract(0, 4500)
>>> alnB = aln1.extract(4500, None)
>>> cs.configure(multi=True)
>>> cs.set_structure(struct)
>>> cs.process_align(alnA)
>>> cs.process_align(alnB)
>>> print(cs.results())
{'ZnS': 0.17236275214525582}

To test what happens when we provide the sites individually, we extract all sites of the alignment (skipping those with missing data to match what process_align() does by default):

>>> sites = []
>>> for i in range(aln1.ls):
...     site = egglib.site_from_align(aln1, i)
...     if site.num_missing == 0:
...         sites.append(site)
...

The statistics in question are not computed if we provide the sites individually and one by one, regardless of whether they are phased or not, because nothing guarantees that the site objects will be constant until statistics are actually computed:

>>> for site in sites:
...     cs.process_site(site)
...
>>> print(cs.results())
{'ZnS': None}

In that case, it is necessary to pass all sites grouped together in one list:

>>> print cs.process_sites(sites)
{'ZnS': 0.17067362457914195}

Stand-alone allelic frequencies

It can happen that only allelic frequencies are available (such as with bulk sequencing). In that case Site is not appropriate because it requires an ordering of samples. It would be possible to arbitrarily create a site from a set of allelic frequencies but this would be a pointless waste of computing resources. There is a class in EggLib, named Freq, fulfilling this aim.

There is a logical relationship between the classes Align, Site, and Freq: the latter two can be created based on instances of the respective previous one, but they can also be provided by external means. In other words, Align and even Site can be bypassed if the corresponding data is not available. We have seen in the previous section that it is not necessary to artifically create an Align instance if only available sites are available. Likewise, Site can be bypassed if only frequencies are available.

Like Site, Freq instances can be created from different types of source: from a Site instance, from a user-specified list, or from VCF data.

Generator functions

Source of data

freq_from_site()

A Site instance.

freq_from_list()

A user-provided list of data.

freq_from_vcf()

The current data of a VCF parser (see Using VCF files).

There are also equivalent methods to recycle an existing instance.

Creation of Freq instances from a site

If we go back to the example site created above, we see that creating a Freq instance is rather straighforward (see highlighted line below):

 site = egglib.site_from_list(['C', 'G', 'G', 'C', 'T', 'T', 'G', 'T', 'G', 'G', 'G', 'G'], alphabet=egglib.alphabets.DNA)
 freq = egglib.freq_from_site(site)
 for i in range(freq.num_alleles):
     print(freq.freq_allele(i))

Creation of Freq instances from user-provided frequency data

To create a Freq instance from already computed allelic frequencies, the syntax is logical but the data format must be followed carefully. Here, it is necessary to provide allelic or genotypic frequencies while taking into account population structure. Formally, freq_from_list() expects allelic population frequencies for an arbitrary number of clusters (at least one). The number of alleles, of clusters, and of populations per clusters are deduced from provided data. In practice, this requires that you provide a nested list of frequencies with three levels: clusters, populations, individuals.

The basic structure of the input nested list is:
  • First level: [cluster1, cluster2, ...]

  • Second level: cluster1 = [pop1, pop2, ...]

  • Third level: pop1 = [p1, p2, ...] where p1 is the absolute frequency of the first allele (the number of allele must match over all populations.

In many cases, there will be no clusters, which is actually equivalent to a single cluster. In this case, the first level would be a list containing a single item: [[pop1, pop2, ...]].

It is possible also that there is no population structure. Then this level can be also bypassed by using a single-item list.

The outgroup is loaded as the second argument, and is provided as another list that is analogous to an additional population (a list of frequencies of all alleles, even if there is only one sample).

Actually, for simple cases, the syntax is not verbose. The above example, assuming a single cluster and a single population, while providing allelic frequencies only, writes as follows:

freq = egglib.freq_from_list([[[3, 3, 2, 1, 1]]], [1, 0, 0, 0, 0])

If your data is based on genotypes, you can provide genotypic frequencies. In this case you are required to provided as well a list detailing the allele composition of each genotypes considered. In the example above, there are also five genotypes, but each is represented by a single copy. We need to recode them, and use the geno_list option:

freq = egglib.freq_from_list([[[1, 1, 1, 1, 1]]], [1, 0, 0, 0, 0], geno_list=[(0, 0), (0, 1), (2, 2), (3, 1), (1, 4)],alphabet=egglib.alphabets.positive_infinite)

The allelic frequencies are computed automatically based on the composition of genotypes as provided.

Computing diversity statistics

Using the same example as above, we can see that we can also compute diversity statistics for a single site or from an array of sites when only frequencies are available (obviously, statistics requiring haplotypic information will not be available). For this we need to use the method process_freq() of stats.ComputeStats:

cs.clear_stats()
cs.add_stats('Aing', 'He', 'R')
site = egglib.site_from_align(aln1, 66)
freq = egglib.freq_from_site(site)
print(cs.process_freq(freq))

cs.add_stats('D', 'Pi')
cs.configure(multi=True)
cs.set_structure(struct)
site = egglib.Site()
freq = egglib.Freq()
for i in range(aln1.ls):
    site.from_align(aln1,i)
    freq.from_site(site)
    cs.process_freq(freq)
print(cs.results())
{'R': 0.02, 'Aing': 3, 'He': 0.6178217821782178}
{'Aing': 1.1378886155222545, 'Pi': 193.00524876495805, 'He': 0.023322406154947233, 'D': -0.4440572753610234, 'R': 0.0013951963115034285}